David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700): > On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote: > > On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:
> >> In order to print Unicode text strings (as opposed to octet > >> strings) correctly to a terminal (UTF-8 or not), add the following > >> line before the first output: > >> > >> binmode STDOUT, ':utf8'; > >> > >> But note that STDOUT is global. > > > > Yes, I do this all the time. Surprisingly, I don't get warnings for > > this script, even though it is outputting multibyte characters. > > This is key. If I set the binmode on STDOUT to :utf8, the bogus > characters print out bogus. If I set it to :raw, they come out right > after processing by both Encode and XML::LibXML (I'm assuming they're > interpreted as latin-1). Yes, or as raw, which is equivalent. Any octet is valid Latin-1. > So my question is this: Why isn't Encode dying when it runs into these > characters? They're not valid utf-8, AFAICT. Are they somehow valid > utf8 (that is, valid in Perl's internal format)? Why would they be? Assuming we're talking about the same thing here: They're not characters, they're octets. (The Perl documentation seems to make an effort to conceptually distinguish between *octets* and *bytes*, but they map to the same thing.) I found it helpful to accept that the notion of a "UTF-8 character" does not make sense: there are Unicode characters, but UTF-8 is an encoding, and it deals with octets. Here's your script with some modifications to illustrate how things work: \,,,/ (o o) ------oOOo-(_)-oOOo------ use strict; use Encode; use XML::LibXML; # The script is written in UTF-8, but the utf8 pragma is not turned on. # So the literals in our script yield octet strings, not text strings. # (Note that it is probably much more convenient to go with the utf8 # pragma if you write your source code in UTF-8.) my $octets = '<p>Tomas Laurinavičius</p>'; my $txt = decode_utf8( $octets ); my $txt2 = "<p>Tomas Laurinavi\x{010d}ius</p>"; die if $txt2 ne $txt; # they're equal die if $txt2 eq $octets; # they're not equal # print raw UTF-8 octets; looks correct on UTF-8 terminal print $octets, $/; # print text containing wide character to narrow character filehandle print "$txt WARN$/"; # triggers a warning: "Wide character in print" binmode STDOUT, ':utf8'; # set to utf8, accepting wide characters print $txt, $/; # print text to terminal print $octets, $/; # double encoding, č as four bytes my $parser = XML::LibXML->new; # specify encoding for octet string my $doc = $parser->parse_html_string($octets, {encoding => 'utf-8'}); print $doc->documentElement->toString, $/; # no need to specify encoding for text string my $doc2 = $parser->parse_html_string($txt); print $doc2->documentElement->toString, $/; -- Michael Ludwig