On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:
So the UTF8 flag is enabled, and yet it has \303\204\302\215 in it. What is
that crap?
That's octal notation, which I think Dump() uses for any byte greater than 127
and for control characters, so that it can output pure ASCII.
That sequence is only four bytes:
mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = \303\204\302\215;
Encode::_utf8_on($s); Dump $s'
SV = PV(0x801038) at 0x80e880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2012f0 \303\204\302\215\0 [UTF8 \x{c4}\x{8d}]
CUR = 4 --- four bytes
LEN = 8
mar...@smokey:~ $
The logical content of the string follows in the second quote:
[UTF8 pTomas Laurinavi\x{c4}\x{8d}ius/p]
That's valid UTF-8.
my $str = 'pTomas Laurinaviius/p';
In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
does.
my $str = pTomas Laurinavi\x{c4}\x{8d}ius/p
However, because those code points are both representable as Latin-1, Perl
will create a Latin-1 string. If you want to force its internal encoding to
UTF-8, you need to do additional work.
mar...@smokey:~ $ perl -MDevel::Peek -e '$s = \x{c4}; Dump $s;
utf8::upgrade($s); Dump $s'
SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2012e0 \304\0
CUR = 1
LEN = 4
SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2008f0 \303\204\0 [UTF8 \x{c4}]
CUR = 2
LEN = 3
mar...@smokey:~ $
Confused and frustrated,
IMO, to get UTF-8 right consistently in a large Perl system, you need to
understand the internals and you need Devel::Peek at hand. Perl tries to hide
the details, but there are too many ways for it to fail silently. (perl -C,
$YAML::Syck::ImplicitUnicode, etc.)
Marvin Humphrey