Because I had little need for it I had tried to just ignore Perl's Unicode support as long as possible. Now it looks like I can't do that anymore, so I started looking through the various docs.
One thing that confused me: several sources mention Perl using 8-bit characters as long as possible, which seems to contradict some observations. "perluniintro" for example says: "if all code points in the string are 0xFF or less, Perl uses the native eight-bit character set. Otherwise, it uses UTF-8." and the documentation for "Encode": ยท When you decode, the resulting UTF8 flag is on unless you can unambiguously represent data. Here is the definition of dis- ambiguity. After "$utf8 = decode('foo', $octet);", When $octet is... The UTF8 flag in $utf8 is --------------------------------------------- In ASCII only (or EBCDIC only) OFF In ISO-8859-1 ON In any other Encoding ON --------------------------------------------- But when I look at it with Devel::Peek, it seems like after "decoding" - the UTF8 flag is always on - only ASCII characters are stored as bytes, everything else is converted to utf-8 > perl -MDevel::Peek -MEncode -e 'Dump(decode latin1 => "\x41")' SV = PV(0x603e58) at 0x62d620 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x6b3d90 "A"\0 [UTF8 "A"] > perl -MDevel::Peek -MEncode -e 'Dump("\xf6")' SV = PV(0x70bda8) at 0x606f10 REFCNT = 1 FLAGS = (PADTMP,POK,READONLY,pPOK) PV = 0x61ded0 "\366"\0 > perl -MDevel::Peek -MEncode -e 'Dump(decode latin1 => "\xf6")' SV = PV(0x603e58) at 0x62d620 REFCNT = 1 FLAGS = (TEMP,POK,pPOK,UTF8) PV = 0x6b3d90 "\303\266"\0 [UTF8 "\x{f6}"] So, which is true? Is the Unicode documentation obsolete and the internal representation changed (I know, I should not worry about internals ;-) or is the output of Devel::Peek::Dump misleading? Regards, Peter -- To unsubscribe, e-mail: beginners-unsubscr...@perl.org For additional commands, e-mail: beginners-h...@perl.org http://learn.perl.org/