Re: Output character encoding

Arnon Weinberg Thu, 14 Jun 2012 21:34:58 -0700

Thanks very much Josh for investigating this - it saved me some timenarrowing down the issue. Even still, I did spend quite a lot of timeworking out a solution for my needs, and still I don't think it isgeneralizable as-is. However, in case someone else wants to give it acrack, I provide details below.


On 2012-06-05 19:30, Josh Chamas wrote:

doing this is where we have a problem:

<% print Encode::decode('ISO-8859-1',"\xE2"); %>
and immediately in the Apache::ASP::Response::Write() method the datahas already been converted incorrectly

The fact that such a simple use of Encode causes an issue is a littlesurprising. Surely others are using Apache::ASP in multi-languageenvironments - is no one using Encode this way? How are others copingwith this limitation right now?

Its as if by merely going through the tied interface that data goesthrough some conversion process.

Not quite, as the same results happen without a tie'd interface. The"use bytes" pragma is what causes the conversion (see test script below).

Apache::ASP::Response does a "use bytes" which is to deal with theoutput stream correctly I believe this is around content lengthcalculations.I think this is fine here, and turning this off makes things worse forthese examples.

It looks like "use bytes" is now deprecated and should indeed beremoved. The documentation doesn't mention any trivial substitute.However, this pragma mostly just overrides some built-in functions withbyte-oriented versions. So I made the following changes to Response.pm:

- changed use bytes => no bytes (just import the namespace)
- changed all occurrences of length() => bytes::length()

This resolved the mixed-encoding issue originally posted, but introduceda new (more manageable) issue.

For debugging purposes, I peeked at the "UTF-8 flag" (Perl's internalflag that indicates that a string has a known decoding). This flagshould be transparent in principle, but it helped make sense of thebehaviour of Apache::ASP.

Results of testing are summarized as follows:

1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the sameresults with the "use bytes" pragma turned on:

- For any string with the UTF-8 flag off, output is correctly encoded.

- Any string with the flag on is (double-)encoded as UTF-8, regardlessof the actual output encoding.

2. Testing Perl/CGI and asp-perl with "no bytes" produces correct results:

- The UTF-8 flag does not affect output - it is correctly encoded inevery case.- However, an interesting test case is that of the double-encodingproblem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). Thiscase is indicative of bad code, so is not a concern here, but itillustrates how a tie'd filehandle differs from plain STDOUT. In thiscase, a single "wide character" double-encodes the entire output (withbuffering on, this can be the entire page), instead of just the string.

- These test cases are demonstrated by the script below.

3. Testing Apache::ASP with "no bytes" produces different results fromthe command-line (asp-perl) version, as well as different results fromPerl/CGI running on Apache. This suggests an interaction effect betweenApache and Apache::ASP (both are required to produce these results).

- With the UTF-8 flag off, output is correctly encoded as before.

- However, with "no bytes", Apache::ASP, and the UTF-8 flag on, theentire output is double-encoded. This result is similar to thedouble-encoding problem in the previous test case, except that itdoesn't require a "wide character" - any string with the UTF-8 flag onwill do.


This test script demonstrates all but the last test case:

#!/usr/bin/perl

use Encode;

foreach ( "STDOUT", "tie_use_bytes", "tie_no_bytes" )
{
print "$_: ";
tie *FH, $_ if ! /^S/;
my $STDOUT = select ( FH ) if ! /^S/;
print "\x{263a}",
Encode::decode('ISO-8859-1',"\xE2"),
"\xE2";
print "\n";
close ( FH ) if ! /^S/;
select ( $STDOUT ) if ! /^S/;
}

use strict;

package tie_use_bytes;
use bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

package tie_no_bytes;
no bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

# Output: ##################

Wide character in print at ...
STDOUT: ☺ââ # STDOUT output is correct in all cases

tie_use_bytes: ☺Ã¢â # with "use bytes", the UTF-8-flagged 2nd characteris double-encoded

Wide character in print at ...

tie_no_bytes: ☺Ã¢Ã¢ # with "no bytes", the output is correct, but a"wide character" double-encodes the entire string because of the way thetie'd file handle is implemented


#########################

By the way, if it's getting difficult to wrap your head around this,you're not alone.

At this point, I peeked at the $Response->{out} data buffer, and couldsee that it was encoded correctly. However, the output from Apache (whenthe UTF-8 flag is on) was not correct, suggesting that Apache is doingsomething to encode the string in this case.I decided therefore to address the problem by turning off the UTF-8flag. The most fault-tolerant method I managed to come up with to dothis was the following:


${$Response->{BinaryRef}}
= Encode::encode ( 'ISO-8859-1', ${$Response->{BinaryRef}},
sub{ Encode::encode ( 'UTF-8', chr ( shift() ) ) } )
if ! grep ( /^utf8$/, PerlIO::get_layers ( STDOUT ) );

which can go at the top of the $Response->Flush() method, or inglobal.asa/Script_OnFlush().

With this solution I can now modify Apache::ASP's output encoding (eg,using binmode ( STDOUT );), as originally desired, and the outputappears correct in all my test cases.



--
-------------------------------------------------------------------------------
Arnon Weinberg
www.back2front.ca


---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscr...@perl.apache.org
For additional commands, e-mail: asp-h...@perl.apache.org

Re: Output character encoding

Reply via email to