Edit report at https://bugs.php.net/bug.php?id=62341&edit=1
ID: 62341 Updated by: ras...@php.net Reported by: bfanger at gmail dot com Summary: htmlspecialchars() should work on ascii compatible encodings by default. Status: Open Type: Feature/Change Request Package: *Unicode Issues PHP Version: 5.4.4 Block user comment: N Private report: N New Comment: EUC-JP is heavily used, supported by htmlspecialchars and it is not ASCII compatible. Previous Comments: ------------------------------------------------------------------------ [2012-06-17 17:47:34] bfanger at gmail dot com Rereading the manpage more thoroughly, all the info is there. Another nice resource is http://nikic.github.com/2012/01/28/htmlspecialchars-improvements-in- PHP-5-4.html I now disagree with the decision of the empty string, with php flexible typing this should have been false or null. In php5.4 no longer has the weird 'only errors when "display_errors" is off behavior', but sadly the chosen behaviour is to alway silently supress those errors. If throwing E_WARING is too risky, an E_ENCODING error level would be very welcome addition. ENT_IGNORE: Removes special characters from the string instead of ignoring them. (My previous statement "unless ENT_IGNORE is passed." is therefor invalid) Using strtr($text, array('<' => '<', '>' => '>', '&' => '&')); is 35% slower than htmlspecialchars($text, ENT_NOQUOTES, 'ISO-8859-1') which has the same output. The securityrisk applies only to multibyte encoding which always uses 2 or more bytes per characters, like UTF-16 (but UTF-16 and UTF-32 aren't supported by htmlspecialchars, i'm not sure if any of the supported charsets is incompatible with ascii) My framework uses UTF-8 for 95% percent of the time, but to prevent silent trucating i'll have to add 'ISO-8859-1' as encoding. It just feels wrong. The default charset for htmlspecialchars should be "ASCII compatible" "the encodings ISO-8859-1, ISO-8859-15, UTF-8, cp866, cp1251, cp1252, and KOI8-R are effectively equivalent" no ifs, no buts. ------------------------------------------------------------------------ [2012-06-17 10:25:58] bfanger at gmail dot com Updated summary to "Secure behavior htmlspecialchars() not reflected in the documentation" My initial change request "htmlspecialchars() should work on ascii compatible encodings by default" no longer applies. After some research agree with the new behavior. ------------------------------------------------------------------------ [2012-06-17 10:06:34] bfanger at gmail dot com Description: ------------ In PHP 5.4 the default encoding for htmlentities is changed to 'UTF-8', When a ISO-8859-1 encoded string with a special character is passed to the htmlspecialchars() it returns an empty string (invalid mutlibyte sequence) This is the new intended (and more secure) behavior, and i agree, but... The old default (ISO-8859-1) worked on both UTF-8, ISO-8859-1 and other ascii compatible encodings, which is reflected in the documentation: "Calling htmlspecialchars() is sufficient if the encoding supports all characters in the input string (such us UTF-8 but also ISO-8859-1 on ISO-8859-1 only input). htmlentities() needs to be called only if the output encoding doesn't support all characters in the input string." This is no longer the case, unless ENT_IGNORE is passed. Solution: Drop the paragraph from the documentation. PS: You might wan't to add a paragraph that incorrect encoded text will cause htmlspecialschars() to return an empty string. ------------------------------------------------------------------------ -- Edit this bug report at https://bugs.php.net/bug.php?id=62341&edit=1