Thanks very much Josh for investigating this - it saved me some time narrowing down the issue. Even still, I did spend quite a lot of time working out a solution for my needs, and still I don't think it is generalizable as-is. However, in case someone else wants to give it a crack, I provide details below.

On 2012-06-05 19:30, Josh Chamas wrote:
doing this is where we have a problem:

<% print Encode::decode('ISO-8859-1',"\xE2"); %>

and immediately in the Apache::ASP::Response::Write() method the data has already been converted incorrectly

The fact that such a simple use of Encode causes an issue is a little surprising. Surely others are using Apache::ASP in multi-language environments - is no one using Encode this way? How are others coping with this limitation right now?

Its as if by merely going through the tied interface that data goes through some conversion process.

Not quite, as the same results happen without a tie'd interface. The "use bytes" pragma is what causes the conversion (see test script below).

Apache::ASP::Response does a "use bytes" which is to deal with the output stream correctly I believe this is around content length calculations. I think this is fine here, and turning this off makes things worse for these examples.

It looks like "use bytes" is now deprecated and should indeed be removed. The documentation doesn't mention any trivial substitute. However, this pragma mostly just overrides some built-in functions with byte-oriented versions. So I made the following changes to Response.pm:
- changed use bytes => no bytes (just import the namespace)
- changed all occurrences of length() => bytes::length()
This resolved the mixed-encoding issue originally posted, but introduced a new (more manageable) issue.

For debugging purposes, I peeked at the "UTF-8 flag" (Perl's internal flag that indicates that a string has a known decoding). This flag should be transparent in principle, but it helped make sense of the behaviour of Apache::ASP.
Results of testing are summarized as follows:

1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same results with the "use bytes" pragma turned on:
- For any string with the UTF-8 flag off, output is correctly encoded.
- Any string with the flag on is (double-)encoded as UTF-8, regardless of the actual output encoding.
2. Testing Perl/CGI and asp-perl with "no bytes" produces correct results:
- The UTF-8 flag does not affect output - it is correctly encoded in every case. - However, an interesting test case is that of the double-encoding problem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). This case is indicative of bad code, so is not a concern here, but it illustrates how a tie'd filehandle differs from plain STDOUT. In this case, a single "wide character" double-encodes the entire output (with buffering on, this can be the entire page), instead of just the string.
- These test cases are demonstrated by the script below.
3. Testing Apache::ASP with "no bytes" produces different results from the command-line (asp-perl) version, as well as different results from Perl/CGI running on Apache. This suggests an interaction effect between Apache and Apache::ASP (both are required to produce these results).
- With the UTF-8 flag off, output is correctly encoded as before.
- However, with "no bytes", Apache::ASP, and the UTF-8 flag on, the entire output is double-encoded. This result is similar to the double-encoding problem in the previous test case, except that it doesn't require a "wide character" - any string with the UTF-8 flag on will do.

This test script demonstrates all but the last test case:

#!/usr/bin/perl

use Encode;

foreach ( "STDOUT", "tie_use_bytes", "tie_no_bytes" )
{
print "$_: ";
tie *FH, $_ if ! /^S/;
my $STDOUT = select ( FH ) if ! /^S/;
print "\x{263a}",
Encode::decode('ISO-8859-1',"\xE2"),
"\xE2";
print "\n";
close ( FH ) if ! /^S/;
select ( $STDOUT ) if ! /^S/;
}

use strict;

package tie_use_bytes;
use bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

package tie_no_bytes;
no bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }

# Output: ##################

Wide character in print at ...
STDOUT: ☺ââ # STDOUT output is correct in all cases
tie_use_bytes: ☺ââ # with "use bytes", the UTF-8-flagged 2nd character is double-encoded
Wide character in print at ...
tie_no_bytes: ☺ââ # with "no bytes", the output is correct, but a "wide character" double-encodes the entire string because of the way the tie'd file handle is implemented

#########################

By the way, if it's getting difficult to wrap your head around this, you're not alone.

At this point, I peeked at the $Response->{out} data buffer, and could see that it was encoded correctly. However, the output from Apache (when the UTF-8 flag is on) was not correct, suggesting that Apache is doing something to encode the string in this case. I decided therefore to address the problem by turning off the UTF-8 flag. The most fault-tolerant method I managed to come up with to do this was the following:

${$Response->{BinaryRef}}
= Encode::encode ( 'ISO-8859-1', ${$Response->{BinaryRef}},
sub{ Encode::encode ( 'UTF-8', chr ( shift() ) ) } )
if ! grep ( /^utf8$/, PerlIO::get_layers ( STDOUT ) );

which can go at the top of the $Response->Flush() method, or in global.asa/Script_OnFlush().

With this solution I can now modify Apache::ASP's output encoding (eg, using binmode ( STDOUT );), as originally desired, and the output appears correct in all my test cases.


--
-------------------------------------------------------------------------------
Arnon Weinberg
www.back2front.ca


---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscr...@perl.apache.org
For additional commands, e-mail: asp-h...@perl.apache.org

Reply via email to