Thanks very much Josh for investigating this - it saved me some time
narrowing down the issue. Even still, I did spend quite a lot of time
working out a solution for my needs, and still I don't think it is
generalizable as-is. However, in case someone else wants to give it a
crack, I provide details below.
On 2012-06-05 19:30, Josh Chamas wrote:
doing this is where we have a problem:
<% print Encode::decode('ISO-8859-1',"\xE2"); %>
and immediately in the Apache::ASP::Response::Write() method the data
has already been converted incorrectly
The fact that such a simple use of Encode causes an issue is a little
surprising. Surely others are using Apache::ASP in multi-language
environments - is no one using Encode this way? How are others coping
with this limitation right now?
Its as if by merely going through the tied interface that data goes
through some conversion process.
Not quite, as the same results happen without a tie'd interface. The
"use bytes" pragma is what causes the conversion (see test script below).
Apache::ASP::Response does a "use bytes" which is to deal with the
output stream correctly I believe this is around content length
calculations.
I think this is fine here, and turning this off makes things worse for
these examples.
It looks like "use bytes" is now deprecated and should indeed be
removed. The documentation doesn't mention any trivial substitute.
However, this pragma mostly just overrides some built-in functions with
byte-oriented versions. So I made the following changes to Response.pm:
- changed use bytes => no bytes (just import the namespace)
- changed all occurrences of length() => bytes::length()
This resolved the mixed-encoding issue originally posted, but introduced
a new (more manageable) issue.
For debugging purposes, I peeked at the "UTF-8 flag" (Perl's internal
flag that indicates that a string has a known decoding). This flag
should be transparent in principle, but it helped make sense of the
behaviour of Apache::ASP.
Results of testing are summarized as follows:
1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same
results with the "use bytes" pragma turned on:
- For any string with the UTF-8 flag off, output is correctly encoded.
- Any string with the flag on is (double-)encoded as UTF-8, regardless
of the actual output encoding.
2. Testing Perl/CGI and asp-perl with "no bytes" produces correct results:
- The UTF-8 flag does not affect output - it is correctly encoded in
every case.
- However, an interesting test case is that of the double-encoding
problem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). This
case is indicative of bad code, so is not a concern here, but it
illustrates how a tie'd filehandle differs from plain STDOUT. In this
case, a single "wide character" double-encodes the entire output (with
buffering on, this can be the entire page), instead of just the string.
- These test cases are demonstrated by the script below.
3. Testing Apache::ASP with "no bytes" produces different results from
the command-line (asp-perl) version, as well as different results from
Perl/CGI running on Apache. This suggests an interaction effect between
Apache and Apache::ASP (both are required to produce these results).
- With the UTF-8 flag off, output is correctly encoded as before.
- However, with "no bytes", Apache::ASP, and the UTF-8 flag on, the
entire output is double-encoded. This result is similar to the
double-encoding problem in the previous test case, except that it
doesn't require a "wide character" - any string with the UTF-8 flag on
will do.
This test script demonstrates all but the last test case:
#!/usr/bin/perl
use Encode;
foreach ( "STDOUT", "tie_use_bytes", "tie_no_bytes" )
{
print "$_: ";
tie *FH, $_ if ! /^S/;
my $STDOUT = select ( FH ) if ! /^S/;
print "\x{263a}",
Encode::decode('ISO-8859-1',"\xE2"),
"\xE2";
print "\n";
close ( FH ) if ! /^S/;
select ( $STDOUT ) if ! /^S/;
}
use strict;
package tie_use_bytes;
use bytes;
sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }
package tie_no_bytes;
no bytes;
sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()->{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()->{out} ); }
# Output: ##################
Wide character in print at ...
STDOUT: ☺ââ # STDOUT output is correct in all cases
tie_use_bytes: ☺ââ # with "use bytes", the UTF-8-flagged 2nd character
is double-encoded
Wide character in print at ...
tie_no_bytes: ☺ââ # with "no bytes", the output is correct, but a
"wide character" double-encodes the entire string because of the way the
tie'd file handle is implemented
#########################
By the way, if it's getting difficult to wrap your head around this,
you're not alone.
At this point, I peeked at the $Response->{out} data buffer, and could
see that it was encoded correctly. However, the output from Apache (when
the UTF-8 flag is on) was not correct, suggesting that Apache is doing
something to encode the string in this case.
I decided therefore to address the problem by turning off the UTF-8
flag. The most fault-tolerant method I managed to come up with to do
this was the following:
${$Response->{BinaryRef}}
= Encode::encode ( 'ISO-8859-1', ${$Response->{BinaryRef}},
sub{ Encode::encode ( 'UTF-8', chr ( shift() ) ) } )
if ! grep ( /^utf8$/, PerlIO::get_layers ( STDOUT ) );
which can go at the top of the $Response->Flush() method, or in
global.asa/Script_OnFlush().
With this solution I can now modify Apache::ASP's output encoding (eg,
using binmode ( STDOUT );), as originally desired, and the output
appears correct in all my test cases.
--
-------------------------------------------------------------------------------
Arnon Weinberg
www.back2front.ca
---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscr...@perl.apache.org
For additional commands, e-mail: asp-h...@perl.apache.org