On Jun 16, 2010, at 6:03 PM, Marvin Humphrey wrote: > On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: > >> So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What >> is that crap? > > That's octal notation, which I think Dump() uses for any byte greater than 127 > and for control characters, so that it can output pure ASCII.
Okay. > That sequence is only four bytes: > > mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = "\303\204\302\215"; > Encode::_utf8_on($s); Dump $s' > SV = PV(0x801038) at 0x80e880 > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"] > CUR = 4 <----------------------------------------------- four bytes > LEN = 8 > mar...@smokey:~ $ > > The logical content of the string follows in the second quote: > >> [UTF8 "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"] > > That's valid UTF-8. In what sense? Legally perhaps, but I can make XML::LibXML choke on it. >> my $str = '<p>Tomas Laurinavi????ius</p>'; > > In source code, I try to stick to pure ASCII and use \x escapes -- like Dump() > does. > > my $str = "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>" Okay, that makes it easier to test things (I've been pulling stuff out of the broken feed I downloaded). > However, because those code points are both representable as Latin-1, Perl > will create a Latin-1 string. If you want to force its internal encoding to > UTF-8, you need to do additional work. > > mar...@smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; > utf8::upgrade($s); Dump $s' > SV = PV(0x801038) at 0x80e870 > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0x2012e0 "\304"\0 > CUR = 1 > LEN = 4 > SV = PV(0x801038) at 0x80e870 > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"] > CUR = 2 > LEN = 3 > mar...@smokey:~ $ > >> Confused and frustrated, > > IMO, to get UTF-8 right consistently in a large Perl system, you need to > understand the internals and you need Devel::Peek at hand. Perl tries to hide > the details, but there are too many ways for it to fail silently. ("perl -C", > $YAML::Syck::ImplicitUnicode, etc.) Bleh. Such a PITA. I'd like not to have to think about this stuff, but I must because other people haven't. So here's my test: use 5.12.0; use Devel::Peek; my $str = "<p>Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius</p>"; say $str; utf8::upgrade($str); binmode STDOUT, ':utf8'; say $str; Dump $str; The output it still broken, however, in both cases, looking like this: LaurinaviÄius LaurinaviÃÂius SV = PV(0x100801c78) at 0x10082ac40 REFCNT = 1 FLAGS = (PADMY,POK,pPOK,UTF8) PV = 0x100202170 "Laurinavi\303\203\302\204\303\202\302\215ius"\0 [UTF8 "Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius"] CUR = 20 LEN = 32 So it may be valid UTF-8, but why does it come out looking like crap? That is, "LaurinaviÃÂius"? I suppose there's an argument that "LaurinaviÄius" is correct and valid, if ugly. Maybe? Thanks, David