Re: Output character encoding
Thanks very much Josh for investigating this - it saved me some time narrowing down the issue. Even still, I did spend quite a lot of time working out a solution for my needs, and still I don't think it is generalizable as-is. However, in case someone else wants to give it a crack, I provide details below. On 2012-06-05 19:30, Josh Chamas wrote: doing this is where we have a problem: % print Encode::decode('ISO-8859-1',\xE2); % and immediately in the Apache::ASP::Response::Write() method the data has already been converted incorrectly The fact that such a simple use of Encode causes an issue is a little surprising. Surely others are using Apache::ASP in multi-language environments - is no one using Encode this way? How are others coping with this limitation right now? Its as if by merely going through the tied interface that data goes through some conversion process. Not quite, as the same results happen without a tie'd interface. The use bytes pragma is what causes the conversion (see test script below). Apache::ASP::Response does a use bytes which is to deal with the output stream correctly I believe this is around content length calculations. I think this is fine here, and turning this off makes things worse for these examples. It looks like use bytes is now deprecated and should indeed be removed. The documentation doesn't mention any trivial substitute. However, this pragma mostly just overrides some built-in functions with byte-oriented versions. So I made the following changes to Response.pm: - changed use bytes = no bytes (just import the namespace) - changed all occurrences of length() = bytes::length() This resolved the mixed-encoding issue originally posted, but introduced a new (more manageable) issue. For debugging purposes, I peeked at the UTF-8 flag (Perl's internal flag that indicates that a string has a known decoding). This flag should be transparent in principle, but it helped make sense of the behaviour of Apache::ASP. Results of testing are summarized as follows: 1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same results with the use bytes pragma turned on: - For any string with the UTF-8 flag off, output is correctly encoded. - Any string with the flag on is (double-)encoded as UTF-8, regardless of the actual output encoding. 2. Testing Perl/CGI and asp-perl with no bytes produces correct results: - The UTF-8 flag does not affect output - it is correctly encoded in every case. - However, an interesting test case is that of the double-encoding problem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). This case is indicative of bad code, so is not a concern here, but it illustrates how a tie'd filehandle differs from plain STDOUT. In this case, a single wide character double-encodes the entire output (with buffering on, this can be the entire page), instead of just the string. - These test cases are demonstrated by the script below. 3. Testing Apache::ASP with no bytes produces different results from the command-line (asp-perl) version, as well as different results from Perl/CGI running on Apache. This suggests an interaction effect between Apache and Apache::ASP (both are required to produce these results). - With the UTF-8 flag off, output is correctly encoded as before. - However, with no bytes, Apache::ASP, and the UTF-8 flag on, the entire output is double-encoded. This result is similar to the double-encoding problem in the previous test case, except that it doesn't require a wide character - any string with the UTF-8 flag on will do. This test script demonstrates all but the last test case: #!/usr/bin/perl use Encode; foreach ( STDOUT, tie_use_bytes, tie_no_bytes ) { print $_: ; tie *FH, $_ if ! /^S/; my $STDOUT = select ( FH ) if ! /^S/; print \x{263a}, Encode::decode('ISO-8859-1',\xE2), \xE2; print \n; close ( FH ) if ! /^S/; select ( $STDOUT ) if ! /^S/; } use strict; package tie_use_bytes; use bytes; sub TIEHANDLE { bless {}, shift; } sub PRINT { shift()-{out} .= join ( $,, @_ ); } sub CLOSE { print STDOUT delete ( shift()-{out} ); } package tie_no_bytes; no bytes; sub TIEHANDLE { bless {}, shift; } sub PRINT { shift()-{out} .= join ( $,, @_ ); } sub CLOSE { print STDOUT delete ( shift()-{out} ); } # Output: ## Wide character in print at ... STDOUT: ☺ââ # STDOUT output is correct in all cases tie_use_bytes: ☺ââ # with use bytes, the UTF-8-flagged 2nd character is double-encoded Wide character in print at ... tie_no_bytes: ☺ââ # with no bytes, the output is correct, but a wide character double-encodes the entire string because of the way the tie'd file handle is implemented # By the way, if it's getting difficult to wrap your head around this, you're not alone. At this point, I peeked at the $Response-{out} data buffer, and could see that it was encoded correctly. However, the output from Apache (when the UTF-8 flag is on)
Re: Output character encoding
On 2012-06-05 05:55, Warren Young wrote: There are several places where you set this, not just one, and they all have to agree to guarantee correct output: DB - back end - Apache - HTML - Apache::ASP - browser If they do not all agree, you can either get mixed encodings or encoding ping-ponging. So, you have to check all the links in that chain: With my test cases (provided) I have carefully narrowed down the inconsistency to Apache::ASP, since everything else is either not applicable or the same. - Apache has things like the AddDefaultCharset directive which play into this. No, it doesn't, since I'm not testing the browser. For the record though, when I use GET -e, I see the correct header in both tests: Content-Type: text/html; charset=ISO-8859-1 - For the Perl aspects, I recommend just reading the Perl manual chapter on it: perldoc perlunicode. Perl's Unicode support is deep, broad, and continually evolving[*]. You really must read your particular version's docs to know exactly how it's going to behave. There have been several breaking changes over the past decade or so. Perl is behaving as documented. Apache::ASP is giving me trouble. - There are at least three ways to set the character encoding in your HTML. RTFEE: https://en.wikipedia.org/wiki/Character_encodings_in_HTML - And finally, it's possible to set a browser to ignore whatever it's told by the HTTP server and the document, and force it to interpret the data using some other character set. That's all true, but none of it matters since with a mixed encoding output, there is no character set encoding that I can use on the browser to show a correct decoding. Regular perl/CGI output defaults to ISO-8859-1 encoding, Really? I'd expect it to take the overall Perl default, which is UTF-8 on most Unix type systems with Perl 5.6 onward on OSes contemporary with that version of Perl. I would have expected that you'd have to go out of your way to force a return to Latin-1. Yes, this is right out of the manual (open): ... the default layer for the operating system (:raw on Unix, :crlf on Windows) is used. The :utf8 output layer encoding must be explicitly set, as it is not the default. However, I have not figured out how to do this successfully within Apache::ASP. It's 2012. Please, please, please abandon Latin-1. Everything speaks UTF-8 these days, at the borders at least, even systems like Windows and JavaScript where it isn't the native character set. It is safe to consider UTF-8 the standard Unicode encoding online. This is part of an exercise to do just that. At the moment, we have many lines of legacy code still using Latin-1, and are converting them step-wise to use UTF-8. As the test cases show however, they do not play well together on Apache::ASP (though they are fine everywhere else). If anyone has any suggestions on how this can be resolved so that we can continue the conversion, that would be much appreciated. -- --- Arnon Weinberg www.back2front.ca - To unsubscribe, e-mail: asp-unsubscr...@perl.apache.org For additional commands, e-mail: asp-h...@perl.apache.org
Re: Output character encoding
With my test cases (provided) I have carefully narrowed down the inconsistency to Apache::ASP, since everything else is either not applicable or the same. Could you be a bit more specific on this ? I've built many a site in international character sets and using Apache::ASP for well over decade, so I can tell you that it it works just fine with UTF-8 (and ISO-8859-[157] if that matters). Last problem was back in 2004 when Content-Length was incorrectly calculated. No, it doesn't, since I'm not testing the browser. For the record though, when I use GET -e, I see the correct header in both tests: Content-Type: text/html; charset=ISO-8859-1 That's as simple as ``$Response-{ContentType} = text/html; charset=UTF-8;'' It doesn't tell us anything about the actual encoding of the content. Bear in mind that your selected encoding might be insufficient to display the text you're feeding it. Yes, this is right out of the manual (open): ... the default layer for the operating system (:raw on Unix, :crlf on Windows) is used. The :utf8 output layer encoding must be explicitly set, as it is not the default. However, I have not figured out how to do this successfully within Apache::ASP. How does file handling come into play here ? Not that it's relevant but it works quite the same way as outside of Apache::ASP. This is part of an exercise to do just that. At the moment, we have many lines of legacy code still using Latin-1, and are converting them step-wise to use UTF-8. As the test cases show however, they do not play well together on Apache::ASP (though they are fine everywhere else). If anyone has any suggestions on how this can be resolved so that we can continue the conversion, that would be much appreciated. Have a look at Text::Iconv, iconv(1), iconv(3) and friends. Also, Encode. Best Regards, Thanos Chatziathanassiou - To unsubscribe, e-mail: asp-unsubscr...@perl.apache.org For additional commands, e-mail: asp-h...@perl.apache.org