Stas and all of the others,

Stas said:
>I think I got your problem solved, you need to:

>- print $q->header();
>+ print $q->header("text/html; charset=utf-8");

Well actually you did not.
Probably you looked a bit too fast.
(forgivable in view of the numbers of mails you reply to:-)

The utf8-test.pl code is reading what comes out of the form (which has a
charset=utf-8 meta tag, so that is OK, see my previous mail)
The utf8-test.pl then replaces the characters higher the 7F with char. ref
entities but with the string '+entity: ' in front of the value(see below
lines 11 and 12 of utf8-test.pl).
And to double verify the information read back from the form is also
unpacked from unicode values into their hex counterparts.
And then both strings are printed out as normal low ascii characters (<7f),
so no need to set the utf-8 flag here.

>From further testing I have seen that only unicode characters that actually
have a representation in the win1252 characters set come back under their
corresponding win1252 characterset position.
So the form would for example contain an ndash character (unicode position
dec 8211 or U+2013) .
But that is read back as character dec 150 or hex 96.
And if the form contains a right single quotation (unicode position dec 8217
or U+2019), it comes back under its win1252 position of dec 146 or hex 92.

I would have expected if I send something in under its unicode position, it
would come back to me under its unicode position.
But then again I may be wrong.
And the utf8 flag in the header only means that is will be utf8 encoded and
should not be confused with the characterset used.

I am under the impression I confusing myself more and more here.
So if somebody has been on this path before and knows the truth, let him
speak up!

(Oh did I mention already that I have tested only against IE6, because the
browser could be the cause as well of this odd(?) behaviour.)

Thanks all for your patience.
I would really like to get to the bottom of this.

Bart

Here is utf8-test.pl, again this time with line numbers
1:#!/perl/bin/perl.exe
2:use strict;
3:use CGI;
4:use CGI::Carp qw(fatalsToBrowser);
5:
6:my $q = CGI->new;
7:my $content = $q->param("utf8-test");
8:$content .= "verify with \x{2014}";
9:my @content = unpack('U*', $content);
10:$content =~ s/([\x{0800}-\x{FFFF}])/sprintf('+entity:%d+',ord($1))/ge;
11:$content =~ s/([\x{0080}-\x{07FF}])/sprintf('+entity: %d+',ord($1))/ge;
12:print $q->header();
13:print $q->p($content);
14:print $q->p('hex');
15:foreach (@content) {printf "%x ", $_}

and here is the htlm form that triggers the utf8-test.pl:
<html xmlns="http://www.w3.org/1999/xhtml"; lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>

<body>
<form method="post" action="/mod_perl/utf8-test.pl"
enctype="multipart/form-data">
<textarea name ='utf8-test' cols='60'>test: &#235; &#8212;</textarea>
&nbsp;&nbsp;<input type="submit" value="publish new content"/></h4>
</form>
</body></html>

and here is the result this all produces:
test: +entity: 235+ +entity: 151+verify with +entity:8212+

hex

74 65 73 74 3a 20 eb 20 97 76 65 72 69 66 79 20 77 69 74 68 20 2014

Reply via email to