Re: Output character encoding

2012-06-14 Thread Arnon Weinberg


Thanks very much Josh for investigating this - it saved me some time 
narrowing down the issue. Even still, I did spend quite a lot of time 
working out a solution for my needs, and still I don't think it is 
generalizable as-is. However, in case someone else wants to give it a 
crack, I provide details below.


On 2012-06-05 19:30, Josh Chamas wrote:

doing this is where we have a problem:

% print Encode::decode('ISO-8859-1',\xE2); %

and immediately in the Apache::ASP::Response::Write() method the data 
has already been converted incorrectly


The fact that such a simple use of Encode causes an issue is a little 
surprising. Surely others are using Apache::ASP in multi-language 
environments - is no one using Encode this way? How are others coping 
with this limitation right now?


Its as if by merely going through the tied interface that data goes 
through some conversion process.


Not quite, as the same results happen without a tie'd interface. The 
use bytes pragma is what causes the conversion (see test script below).


Apache::ASP::Response does a use bytes which is to deal with the 
output stream correctly I believe this is around content length 
calculations.
I think this is fine here, and turning this off makes things worse for 
these examples.


It looks like use bytes is now deprecated and should indeed be 
removed. The documentation doesn't mention any trivial substitute. 
However, this pragma mostly just overrides some built-in functions with 
byte-oriented versions. So I made the following changes to Response.pm:

- changed use bytes = no bytes (just import the namespace)
- changed all occurrences of length() = bytes::length()
This resolved the mixed-encoding issue originally posted, but introduced 
a new (more manageable) issue.


For debugging purposes, I peeked at the UTF-8 flag (Perl's internal 
flag that indicates that a string has a known decoding). This flag 
should be transparent in principle, but it helped make sense of the 
behaviour of Apache::ASP.

Results of testing are summarized as follows:

1. Testing Perl/CGI, asp-perl, and Apache::ASP, all 3 give the same 
results with the use bytes pragma turned on:

- For any string with the UTF-8 flag off, output is correctly encoded.
- Any string with the flag on is (double-)encoded as UTF-8, regardless 
of the actual output encoding.

2. Testing Perl/CGI and asp-perl with no bytes produces correct results:
- The UTF-8 flag does not affect output - it is correctly encoded in 
every case.
- However, an interesting test case is that of the double-encoding 
problem (see http://ahinea.com/en/tech/perl-unicode-struggle.html). This 
case is indicative of bad code, so is not a concern here, but it 
illustrates how a tie'd filehandle differs from plain STDOUT. In this 
case, a single wide character double-encodes the entire output (with 
buffering on, this can be the entire page), instead of just the string.

- These test cases are demonstrated by the script below.
3. Testing Apache::ASP with no bytes produces different results from 
the command-line (asp-perl) version, as well as different results from 
Perl/CGI running on Apache. This suggests an interaction effect between 
Apache and Apache::ASP (both are required to produce these results).

- With the UTF-8 flag off, output is correctly encoded as before.
- However, with no bytes, Apache::ASP, and the UTF-8 flag on, the 
entire output is double-encoded. This result is similar to the 
double-encoding problem in the previous test case, except that it 
doesn't require a wide character - any string with the UTF-8 flag on 
will do.


This test script demonstrates all but the last test case:

#!/usr/bin/perl

use Encode;

foreach ( STDOUT, tie_use_bytes, tie_no_bytes )
{
print $_: ;
tie *FH, $_ if ! /^S/;
my $STDOUT = select ( FH ) if ! /^S/;
print \x{263a},
Encode::decode('ISO-8859-1',\xE2),
\xE2;
print \n;
close ( FH ) if ! /^S/;
select ( $STDOUT ) if ! /^S/;
}

use strict;

package tie_use_bytes;
use bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()-{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()-{out} ); }

package tie_no_bytes;
no bytes;

sub TIEHANDLE { bless {}, shift; }
sub PRINT { shift()-{out} .= join ( $,, @_ ); }
sub CLOSE { print STDOUT delete ( shift()-{out} ); }

# Output: ##

Wide character in print at ...
STDOUT: ☺ââ # STDOUT output is correct in all cases
tie_use_bytes: ☺ââ # with use bytes, the UTF-8-flagged 2nd character 
is double-encoded

Wide character in print at ...
tie_no_bytes: ☺ââ # with no bytes, the output is correct, but a 
wide character double-encodes the entire string because of the way the 
tie'd file handle is implemented


#

By the way, if it's getting difficult to wrap your head around this, 
you're not alone.


At this point, I peeked at the $Response-{out} data buffer, and could 
see that it was encoded correctly. However, the output from Apache (when 
the UTF-8 flag is on) 

Re: Output character encoding

2012-06-05 Thread Arnon Weinberg


On 2012-06-05 05:55, Warren Young wrote:
There are several places where you set this, not just one, and they 
all have to agree to guarantee correct output:


DB - back end - Apache - HTML - Apache::ASP - browser

If they do not all agree, you can either get mixed encodings or 
encoding ping-ponging.


So, you have to check all the links in that chain:


With my test cases (provided) I have carefully narrowed down the 
inconsistency to Apache::ASP, since everything else is either not 
applicable or the same.


- Apache has things like the AddDefaultCharset directive which play 
into this.


No, it doesn't, since I'm not testing the browser.  For the record 
though, when I use GET -e, I see the correct header in both tests: 
Content-Type: text/html; charset=ISO-8859-1


- For the Perl aspects, I recommend just reading the Perl manual 
chapter on it: perldoc perlunicode.  Perl's Unicode support is deep, 
broad, and continually evolving[*].  You really must read your 
particular version's docs to know exactly how it's going to behave.  
There have been several breaking changes over the past decade or so.


Perl is behaving as documented.  Apache::ASP is giving me trouble.

- There are at least three ways to set the character encoding in your 
HTML.  RTFEE: https://en.wikipedia.org/wiki/Character_encodings_in_HTML


- And finally, it's possible to set a browser to ignore whatever it's 
told by the HTTP server and the document, and force it to interpret 
the data using some other character set.


That's all true, but none of it matters since with a mixed encoding 
output, there is no character set encoding that I can use on the browser 
to show a correct decoding.





Regular perl/CGI output defaults to ISO-8859-1 encoding,


Really?  I'd expect it to take the overall Perl default, which is 
UTF-8 on most Unix type systems with Perl 5.6 onward on OSes 
contemporary with that version of Perl.  I would have expected that 
you'd have to go out of your way to force a return to Latin-1.


Yes, this is right out of the manual (open):
... the default layer for the operating system (:raw on Unix, :crlf on 
Windows) is used.
The :utf8 output layer encoding must be explicitly set, as it is not the 
default.  However, I have not figured out how to do this successfully 
within Apache::ASP.


It's 2012.  Please, please, please abandon Latin-1.  Everything speaks 
UTF-8 these days, at the borders at least, even systems like Windows 
and JavaScript where it isn't the native character set.  It is safe to 
consider UTF-8 the standard Unicode encoding online.


This is part of an exercise to do just that.  At the moment, we have 
many lines of legacy code still using Latin-1, and are converting them 
step-wise to use UTF-8.  As the test cases show however, they do not 
play well together on Apache::ASP (though they are fine everywhere 
else).  If anyone has any suggestions on how this can be resolved so 
that we can continue the conversion, that would be much appreciated.



--
---
Arnon Weinberg
www.back2front.ca


-
To unsubscribe, e-mail: asp-unsubscr...@perl.apache.org
For additional commands, e-mail: asp-h...@perl.apache.org



Re: Output character encoding

2012-06-05 Thread Thanos Chatziathanassiou

 With my test cases (provided) I have carefully narrowed down the
 inconsistency to Apache::ASP, since everything else is either not
 applicable or the same.
 

Could you be a bit more specific on this ?

I've built many a site in international character sets and using
Apache::ASP for well over decade, so I can tell you that it it works
just fine with UTF-8 (and ISO-8859-[157] if that matters).
Last problem was back in 2004 when Content-Length was incorrectly
calculated.

 No, it doesn't, since I'm not testing the browser.  For the record
 though, when I use GET -e, I see the correct header in both tests:
 Content-Type: text/html; charset=ISO-8859-1

That's as simple as
``$Response-{ContentType} = text/html; charset=UTF-8;''
It doesn't tell us anything about the actual encoding of the content.
Bear in mind that your selected encoding might be insufficient to
display the text you're feeding it.

 Yes, this is right out of the manual (open):
 ... the default layer for the operating system (:raw on Unix, :crlf on
 Windows) is used.
 The :utf8 output layer encoding must be explicitly set, as it is not the
 default.  However, I have not figured out how to do this successfully
 within Apache::ASP.

How does file handling come into play here ? Not that it's relevant but
it works quite the same way as outside of Apache::ASP.

 
 This is part of an exercise to do just that.  At the moment, we have
 many lines of legacy code still using Latin-1, and are converting them
 step-wise to use UTF-8.  As the test cases show however, they do not
 play well together on Apache::ASP (though they are fine everywhere
 else).  If anyone has any suggestions on how this can be resolved so
 that we can continue the conversion, that would be much appreciated.
 
 

Have a look at Text::Iconv, iconv(1), iconv(3) and friends. Also, Encode.

Best Regards,
Thanos Chatziathanassiou

-
To unsubscribe, e-mail: asp-unsubscr...@perl.apache.org
For additional commands, e-mail: asp-h...@perl.apache.org