On Thu, Jun 17, 2010 at 10:17:52AM -0700, David E. Wheeler wrote: > > The logical content of the string follows in the second quote: > > > >> [UTF8 "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"] > > > > That's valid UTF-8. > > In what sense? Legally perhaps, but I can make XML::LibXML choke on it.
There are two valid states for Perl scalars containing string data. * SVf_UTF8 flag off. * SVf_UTF8 flag on, and string data which is a valid UTF-8 byte sequence. In both cases, we define the logical content of the string as a series of Unicode code points. If the UTF8 flag is off, then the scalar's data will be interpreted as Latin-1. (Except under "use locale" but let's ignore that for now.) Each byte will be interpreted as a single code point. The 256 logical code points in Latin-1 are identical to the first 256 logical code points in Unicode. This is by design -- the Unicode consortium chose to overlap with Latin-1 because it was so common. So any string content that consists solely of code points 255 and under can be represented in Latin-1 without loss. In a Perl scalar with the UTF8 flag on, you can get the code points by decoding the variable width UTF-8 data, with each code point derived by reading 1-5 bytes. *Any* sequence of Unicode code points can be represented without loss. Unfortunately, it is really, really easy to mess up string handling when writing XS modules. A common error is to strip the UTF8 flag accidentally. This changes the scalar's logical content, as now its string data will be interpreted as Latin-1 rather than UTF-8. A less common error is to turn on the UTF8 flag for a scalar which does not contain a valid UTF-8 byte sequence. This puts the scalar into an what I'm calling an "invalid state". It will likely bring your program down with a "panic" error message if you try to do something like run a regex on it. In your case, the Dump of the scalar demonstrated that it had the UTF8 flag set and that it contained a valid UTF-8 byte sequence -- a "valid state". However, it looks like it had invalid content. A scalar with the UTF8 flag off can never be in an "invalid state", because any sequence of bytes is valid Latin-1. However, it's easy to change the string's logical content by accidentally stripping or forgetting to set the UTF8 flag. Unfortunately, this error leads to silent failure -- no error message, but the content changes -- and it can be really hard to debug. This fellow's name, which you can see if you visit <http://twitter.com/tomaslau>, contains Unicode code point 0x010d, "LATIN SMALL LETTER C WITH CARON". As that code point is greater than 255, any Perl string containing his name *must* have the UTF8 flag turned on. I strongly suspect that at some point one of the following two things happened: * The code was input from a UTF-8 source but the input filehandle was not set to UTF-8 -- open (my $fh, '<:encoding(utf8)', $file) or die; * The flag got stripped and subsequently the UTF-8 data was incorrectly reinterpreted as Latin-1. You typically need Devel::Peek for hunting down the second kind of error. > > IMO, to get UTF-8 right consistently in a large Perl system, you need to > > understand the internals and you need Devel::Peek at hand. Perl tries to > > hide > > the details, but there are too many ways for it to fail silently. ("perl > > -C", > > $YAML::Syck::ImplicitUnicode, etc.) > > Bleh. Such a PITA. I'd like not to have to think about this stuff, but I > must because other people haven't. It's more that getting UTF-8 support into Perl without breaking existing programs was a truly awesome hack -- but that one of the limitations of that hack was that the implementation is prone to silent failure. > So here's my test: > > use 5.12.0; > use Devel::Peek; > > my $str = "<p>Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius</p>"; > say $str; > utf8::upgrade($str); > binmode STDOUT, ':utf8'; > say $str; > Dump $str; > > The output it still broken, however, in both cases, looking like this: > > LaurinaviÄius > LaurinaviÃÂius Let's double check something first. Based on your mail client (Apple Mail) I see you're (still) using OS X. Check out Terminal -> Preferences -> Advanced -> Character encoding. What's it set to? If it's not "Unicode (UTF-8)", set it to that now. Then try this: use 5.10.0; use Devel::Peek; my $str = "<p>Tomas Laurinavi\x{010d}ius</p>"; say $str; binmode STDOUT, ':utf8'; say $str; Dump $str; utf8::upgrade($str); # no effect Dump $str; For me, that prints his name correctly twice. The first time, though, I get a "wide character in print" warning. That warning arises because Perl's STDOUT is set to Latin-1 by default. It wants to "downgrade" the UTF8 scalar to Latin-1, but it can't do so without loss, so it warns and outputs the bytes as is. After we change STDOUT to 'utf8', the warning goes away. The utf8::upgrade() call has no effect, because the scalar starts off as UTF8. Prior to the introduction of the UTF8 flag, there was no way to put the code point \x{010d} into a Perl string because Latin-1 can't represent it. For backwards compatibility reasons, \x escapes below 255 have to be represented as Latin-1. Since you asked for \x{010d}, though, Perl knows that the backwards compat rules don't apply and it can use a UTF8 scalar. HTH, Marvin Humphrey