As a result of a recent discussion in perl-unicode, it was apparent that these facts were generally unknown to most folks working with Perl's unicode model:
- Byte strings are upgraded to Unicode strings with "Latin-1". - Unicode strings are downgraded to Byte strings with "UTF-8". - One can change the "Latin-1" part above with "use encoding". The two patches below attempts to better document them, by adding a Caveat item in perlunicode.pod, and add this information to the encoding.pm module. Also, perlunicode.pod used to say: If strings operating under byte semantics and strings with Unicode character data are concatenated, the new string will be upgraded to the new string will be upgraded to I<ISO 8859-1 (Latin-1)> but this is wrong. The new string will be upgraded to Unicode -- it's the old byte string that will be upgraded as Latin-1. The patch below also addresses this. Thanks, /Autrijus/ --- perlunicode.pod.orig Tue Dec 9 19:50:32 2003 +++ perlunicode.pod Tue Dec 9 20:22:37 2003 @@ -42,6 +42,21 @@ You can also use the C<encoding> pragma to change the default encoding of the data in your script; see L<encoding>. +=item C<use encoding> needed to upgrade non-Latin-1 byte strings + +By default, there is a fundamental asymmetry in Perl's unicode model: +implicit upgrading from byte strings to Unicode strings assumes that +they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are +downgraded with UTF-8 encoding. This happens because the first 256 +codepoints in Unicode happens to agree with Latin-1. + +If you wish to interpret byte strings as UTF-8 instead, use the +C<encoding> pragma: + + use encoding 'utf8'; + +See L</"Byte and Character Semantics"> for more details. + =back =head2 Byte and Character Semantics @@ -86,12 +101,12 @@ be used to force byte semantics on Unicode data. If strings operating under byte semantics and strings with Unicode -character data are concatenated, the new string will be upgraded to -I<ISO 8859-1 (Latin-1)>, even if the old Unicode string used EBCDIC. -This translation is done without regard to the system's native 8-bit -encoding, so to change this for systems with non-Latin-1 and -non-EBCDIC native encodings use the C<encoding> pragma. See -L<encoding>. +character data are concatenated, the new string will be created by +decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the +old Unicode string used EBCDIC. This translation is done without +regard to the system's native 8-bit encoding. To change this for +systems with non-Latin-1 and non-EBCDIC native encodings, use the +C<encoding> pragma. See L<encoding>. Under character semantics, many operations that formerly operated on bytes now operate on characters. A character in Perl is --- encoding.pm.orig Tue Dec 9 19:50:37 2003 +++ encoding.pm Tue Dec 9 20:32:15 2003 @@ -192,6 +192,25 @@ You can override this by giving extra arguments; see below. +=head2 Implicit upgrading for byte strings + +By default, if strings operating under byte semantics and strings +with Unicode character data are concatenated, the new string will +be created by decoding the byte strings as I<ISO 8859-1 (Latin-1)>. + +The B<encoding> pragma changes this to use the specified encoding +instead. For example: + + use encoding 'utf8'; + my $string = chr(20000); # a Unicode string + utf8::encode($string); # now it's a UTF-8 encoded byte string + # concatenate with another Unicode string + print length($string . chr(20000)); + +Will print C<2>, because C<$string> is upgraded as UTF-8. Without +C<use encoding 'utf8';>, it will print C<4> instead, since C<$string> +is three octets when interpreted as Latin-1. + =head1 FEATURES THAT REQUIRE 5.8.1 Some of the features offered by this pragma requires perl 5.8.1. Most
pgp00000.pgp
Description: PGP signature