--On Sunday, March 24, 2002 21:57:54 +0000 [EMAIL PROTECTED] wrote: > dougm 02/03/24 13:57:53 > > Modified: . Changes STATUS > src/modules/perl Util.xs > t/net/perl util.pl > Log: > Submitted by: Geoff Young <[EMAIL PROTECTED]> > Reviewed by: dougm > properly escape highbit chars in Apache::Utils::escape_html
This is uncool for those of us using a non-ASCII encoding and sending out lots of characters with the 8th bit set, e.g. in a French page many accented characters will be replaced by 6-byte sequences. If I'm sending out "Content-type: text/html; charset=ISO-8859-1", and calling escape_html to escape '<', '>' and the like, I'm going to be serving quite a lot more bytes than before this patch. However escape_html () has no clue as to what the character set is, and whether it has been correctly specified in the Content-Type. It has also be mentionned here that escape_html is only valid for single-byte encodings. So this patch does the right thing to escape the odd 8 bit char in a mostly ASCII output, but users of other charsets should be warned not to use it. I use HTML::Entities::encode($_[0], '<>&"') myself. Therefore I propose a doc patch to clear this up: Index: Util.pm =================================================================== RCS file: /home/cvs/modperl/Util/Util.pm,v retrieving revision 1.8 diff -u -r1.8 Util.pm --- Util.pm 4 Mar 2000 20:55:47 -0000 1.8 +++ Util.pm 25 Mar 2002 18:19:37 -0000 @@ -68,6 +68,13 @@ my $esc = Apache::Util::escape_html($html); +This function is unaware of its argument's character set and encoding. +It assumes a single-byte encoding and escapes all characters with the +8th bit set. Do not use it with multi-byte encodings such as utf8. +When using a single byte non-ASCII encoding such as ISO-8859-1, +consider specifying the character set in the Content-Type header, +and using HTML::Entities to avoid unnecessary escaping. + =item escape_uri This function replaces all unsafe characters in the $string with their -- Eric Cholet