ID: 17008
Updated by: [EMAIL PROTECTED]
Reported By: [EMAIL PROTECTED]
-Status: Open
+Status: Analyzed
Bug Type: *General Issues
Operating System: WinXP / Apache 1.3.24
PHP Version: 4.2.0
New Comment:
The code in ext/standard/html.c seems to only support
entities found in the first 8 bits of a given charset,
including utf-8. Windows code page 1252 is the only
character set that has em and en dashes in this 8-bit area.
Hence it is the only character set that will work like you
expect it to. In other words, you need to use "cp1252" as
the third argument to htmlentities() and make sure that
your input string is in cp1252 as well.
Support for full utf-8 entities might be coming in a future
release. Meanwhile, you can convert utf-8 to HTML's numeric
character references with PHP's mbstring extension and this
piece of code:
$f = 0xffff; $convmap = array(
/* <!ENTITY % HTMLlat1 PUBLIC
"-//W3C//ENTITIES Latin 1//EN//HTML"> %HTMLlat1; */
160, 255, 0, $f,
/* <!ENTITY % HTMLsymbol PUBLIC
"-//W3C//ENTITIES Symbols//EN//HTML"> %HTMLsymbol; */
402, 402, 0, $f, 913, 929, 0, $f, 931, 937, 0, $f,
945, 969, 0, $f, 977, 978, 0, $f, 982, 982, 0, $f,
8226, 8226, 0, $f, 8230, 8230, 0, $f, 8242, 8243, 0, $f,
8254, 8254, 0, $f, 8260, 8260, 0, $f, 8465, 8465, 0, $f,
8472, 8472, 0, $f, 8476, 8476, 0, $f, 8482, 8482, 0, $f,
8501, 8501, 0, $f, 8592, 8596, 0, $f, 8629, 8629, 0, $f,
8656, 8660, 0, $f, 8704, 8704, 0, $f, 8706, 8707, 0, $f,
8709, 8709, 0, $f, 8711, 8713, 0, $f, 8715, 8715, 0, $f,
8719, 8719, 0, $f, 8721, 8722, 0, $f, 8727, 8727, 0, $f,
8730, 8730, 0, $f, 8733, 8734, 0, $f, 8736, 8736, 0, $f,
8743, 8747, 0, $f, 8756, 8756, 0, $f, 8764, 8764, 0, $f,
8773, 8773, 0, $f, 8776, 8776, 0, $f, 8800, 8801, 0, $f,
8804, 8805, 0, $f, 8834, 8836, 0, $f, 8838, 8839, 0, $f,
8853, 8853, 0, $f, 8855, 8855, 0, $f, 8869, 8869, 0, $f,
8901, 8901, 0, $f, 8968, 8971, 0, $f, 9001, 9002, 0, $f,
9674, 9674, 0, $f, 9824, 9824, 0, $f, 9827, 9827, 0, $f,
9829, 9830, 0, $f,
/* <!ENTITY % HTMLspecial PUBLIC
"-//W3C//ENTITIES Special//EN//HTML"> %HTMLspecial; */
/* These ones are excluded to enable HTML: 34, 38, 60, 62 *
/
338, 339, 0, $f, 352, 353, 0, $f, 376, 376, 0, $f,
710, 710, 0, $f, 732, 732, 0, $f, 8194, 8195, 0, $f,
8201, 8201, 0, $f, 8204, 8207, 0, $f, 8211, 8212, 0, $f,
8216, 8218, 0, $f, 8218, 8218, 0, $f, 8220, 8222, 0, $f,
8224, 8225, 0, $f, 8240, 8240, 0, $f, 8249, 8250, 0, $f,
8364, 8364, 0, $f);
echo mb_encode_numericentity($html, $convmap, "UTF-8");
Previous Comments:
------------------------------------------------------------------------
[2002-05-04 23:10:19] [EMAIL PROTECTED]
if i'm not wrong this function is supposed to encode all those special
characters, right? well, em or en dashes are not encoded. the whole
list of characters that should be encoded can be found here:
http://selfhtml.teamone.de/html/referenz/zeichen.htm#benannte_interpunktion
it's in german, but i guess you can see what i mean.
------------------------------------------------------------------------
--
Edit this bug report at http://bugs.php.net/?id=17008&edit=1