Re: Variation In Decoding Between Encode and XML::LibXML

Michael Ludwig Sat, 19 Jun 2010 04:02:52 -0700

David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700):
> On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:
> > On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:


> >> In order to print Unicode text strings (as opposed to octet
> >> strings) correctly to a terminal (UTF-8 or not), add the following
> >> line before the first output:
> >> 
> >> binmode STDOUT, ':utf8';
> >> 
> >> But note that STDOUT is global.
> > 
> > Yes, I do this all the time. Surprisingly, I don't get warnings for
> > this script, even though it is outputting multibyte characters.
> 
> This is key. If I set the binmode on STDOUT to :utf8, the bogus
> characters print out bogus. If I set it to :raw, they come out right
> after processing by both Encode and XML::LibXML (I'm assuming they're
> interpreted as latin-1).

Yes, or as raw, which is equivalent. Any octet is valid Latin-1.

> So my question is this: Why isn't Encode dying when it runs into these
> characters? They're not valid utf-8, AFAICT. Are they somehow valid
> utf8 (that is, valid in Perl's internal format)? Why would they be?

Assuming we're talking about the same thing here: They're not
characters, they're octets. (The Perl documentation seems to make
an effort to conceptually distinguish between *octets* and *bytes*,
but they map to the same thing.) I found it helpful to accept that
the notion of a "UTF-8 character" does not make sense: there are
Unicode characters, but UTF-8 is an encoding, and it deals with
octets.

Here's your script with some modifications to illustrate how things
work:

          \,,,/
          (o o)
------oOOo-(_)-oOOo------
use strict;
use Encode;
use XML::LibXML;
# The script is written in UTF-8, but the utf8 pragma is not turned on.
# So the literals in our script yield octet strings, not text strings.
# (Note that it is probably much more convenient to go with the utf8
# pragma if you write your source code in UTF-8.)
my $octets = '<p>Tomas Laurinavičius</p>';
my $txt    = decode_utf8( $octets );
my $txt2   = "<p>Tomas Laurinavi\x{010d}ius</p>";

die if $txt2 ne $txt;    # they're equal
die if $txt2 eq $octets; # they're not equal

# print raw UTF-8 octets; looks correct on UTF-8 terminal
print $octets, $/;
# print text containing wide character to narrow character filehandle
print "$txt WARN$/"; # triggers a warning: "Wide character in print"
binmode STDOUT, ':utf8'; # set to utf8, accepting wide characters
print $txt, $/; # print text to terminal
print $octets, $/; # double encoding, č as four bytes

my $parser = XML::LibXML->new;
# specify encoding for octet string
my $doc = $parser->parse_html_string($octets, {encoding => 'utf-8'});
print $doc->documentElement->toString, $/;
# no need to specify encoding for text string
my $doc2 = $parser->parse_html_string($txt);
print $doc2->documentElement->toString, $/;
-- 
Michael Ludwig

Re: Variation In Decoding Between Encode and XML::LibXML

Reply via email to