Re: Variation In Decoding Between Encode and XML::LibXML

David E. Wheeler Fri, 18 Jun 2010 13:36:54 -0700

Marvin,

I can always count on you for a detailed explanation. Thanks. You ought to turn 
this into a blog post!


On Jun 17, 2010, at 4:06 PM, Marvin Humphrey wrote:

> There are two valid states for Perl scalars containing string data.
> 
>  * SVf_UTF8 flag off.
>  * SVf_UTF8 flag on, and string data which is a valid UTF-8 byte sequence.
> 
> In both cases, we define the logical content of the string as a series of
> Unicode code points.  
> 
> If the UTF8 flag is off, then the scalar's data will be interpreted as
> Latin-1.  (Except under "use locale" but let's ignore that for now.)  Each
> byte will be interpreted as a single code point.  The 256 logical code points
> in Latin-1 are identical to the first 256 logical code points in Unicode.
> This is by design -- the Unicode consortium chose to overlap with Latin-1
> because it was so common.  So any string content that consists solely of code
> points 255 and under can be represented in Latin-1 without loss.

Hrm. So am I safe in changing the CP1252 gremlin bytes to proper UTF-8 
characters in Encode::ZapCP1252 like so?

        $_[0] =~ s{([\x80-\x9f])}{
            $table->{$1} ? Encode::decode('UTF-8', $table->{$1}) : $1
        }emxsg

Where `$table` is the lookup table mapping hex values like \x80 to their UTF-8 
equivalents (€)? This is assuming that $_[0] has the UTF8 flag on, of course.

So is this safe? Are \x80-\x9f considered characters when the utf8 flag is on, 
or are they bytes that might break multibyte characters that use those bytes?

> In a Perl scalar with the UTF8 flag on, you can get the code points by
> decoding the variable width UTF-8 data, with each code point derived by
> reading 1-5 bytes.  *Any* sequence of Unicode code points can be represented
> without loss.

Right.

> Unfortunately, it is really, really easy to mess up string handling when
> writing XS modules.  A common error is to strip the UTF8 flag accidentally.
> This changes the scalar's logical content, as now its string data will be
> interpreted as Latin-1 rather than UTF-8.  
> 
> A less common error is to turn on the UTF8 flag for a scalar which does not
> contain a valid UTF-8 byte sequence.  This puts the scalar into an what I'm
> calling an "invalid state".  It will likely bring your program down with a
> "panic" error message if you try to do something like run a regex on it.

Fortunately, I'm not writing XS modules. :-)

> In your case, the Dump of the scalar demonstrated that it had the UTF8 flag
> set and that it contained a valid UTF-8 byte sequence -- a "valid state".
> However, it looks like it had invalid content.

Yes. I broke it with zap_cp1252 (applied before decoding). I just removed that 
and things became valid again. The character was still broken, as it is in the 
feed, but at least it was valid -- and the same as the source.

> A scalar with the UTF8 flag off can never be in an "invalid state", because
> any sequence of bytes is valid Latin-1.  However, it's easy to change the
> string's logical content by accidentally stripping or forgetting to set the
> UTF8 flag.  Unfortunately, this error leads to silent failure -- no error
> message, but the content changes -- and it can be really hard to debug.

Yes, this is what happened to me by zapping the non-utf8 scalar with zap_cp1252 
before decoding it. Bad idea.

> This fellow's name, which you can see if you visit
> <http://twitter.com/tomaslau>, contains Unicode code point 0x010d, "LATIN 
> SMALL
> LETTER C WITH CARON".  As that code point is greater than 255, any Perl string
> containing his name *must* have the UTF8 flag turned on.  
> 
> I strongly suspect that at some point one of the following two things
> happened:
> 
>    * The code was input from a UTF-8 source but the input filehandle was not
>      set to UTF-8 -- open (my $fh, '<:encoding(utf8)', $file) or die;

Well, I was pulling it from HTTP::Response->content. I'm not using 
HTTP::Response->decoded_content because it's XML, which should be binary (see 
http://juerd.nl/site.plp/perluniadvice)

>    * The flag got stripped and subsequently the UTF-8 data was incorrectly
>      reinterpreted as Latin-1.

> You typically need Devel::Peek for hunting down the second kind of error.

I missed that one, fortunately.

> It's more that getting UTF-8 support into Perl without breaking existing
> programs was a truly awesome hack -- but that one of the limitations of that
> hack was that the implementation is prone to silent failure.

Right. It's an impressive achievement. And I can't wait until DBI 2 is built on 
Rakudo. ;-)

>> The output it still broken, however, in both cases, looking like this:
>> 
>>    Laurinavičius
>>    LaurinaviÄius
> 
> Let's double check something first.  Based on your mail client (Apple Mail) I
> see you're (still) using OS X.  Check out Terminal -> Preferences -> Advanced
> -> Character encoding. What's it set to?  If it's not "Unicode (UTF-8)", set
> it to that now.

I always use UTF-8. Snow Leopard actually seems to allow multiple encodings 
(!), as the "Encoding" tab (no more advanced tab) has UTF-8, Mac OS Roman, 
Latin-1, and Latin-9 (wha?) checked, as well as a bunch of other encodings.

> Then try this:
> 
>    use 5.10.0;
>    use Devel::Peek;
> 
>    my $str = "<p>Tomas Laurinavi\x{010d}ius</p>";
>    say $str;
> 
>    binmode STDOUT, ':utf8';
>    say $str;
> 
>    Dump $str;
>    utf8::upgrade($str); # no effect
>    Dump $str;
> 
> For me, that prints his name correctly twice.  The first time, though, I get
> a "wide character in print" warning.  That warning arises because Perl's
> STDOUT is set to Latin-1 by default.  It wants to "downgrade" the UTF8 scalar
> to Latin-1, but it can't do so without loss, so it warns and outputs the bytes
> as is.  After we change STDOUT to 'utf8', the warning goes away.

Yep, same here.

> The utf8::upgrade() call has no effect, because the scalar starts off as
> UTF8.  Prior to the introduction of the UTF8 flag, there was no way to put
> the code point \x{010d} into a Perl string because Latin-1 can't represent it.
> For backwards compatibility reasons, \x escapes below 255 have to be
> represented as Latin-1.   Since you asked for \x{010d}, though, Perl knows
> that the backwards compat rules don't apply and it can use a UTF8 scalar.

Ah, I see. That's probably what happened inside Google Pipes: their code read 
the original feed into a Latin-1 variable somehow, and the \x{010d} got changed 
to \x{c4}\x{8d}, and it wasn't converted back before being output as UTF-8.

Best,

David

Re: Variation In Decoding Between Encode and XML::LibXML

Reply via email to