RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Henning Michael Møller Just
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW)

Which editor do you use? When loading the script in Komodo IDE 5.2 the string 
looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second 
line is correct - the first (no surprise) and third are broken.

Loading the file in UltraEdit-32 13.20+3, set to not convert the script on 
loading, it becomes obvious that what should have been one character is 
represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would 
probably show as 2 characters and as broken.

It looks to me like the string is being displayed as a byte representation of 
the characters, if that makes sense. My english isn't perfect :-/ and what I am 
trying to say is that this is problem that I am quite familiar with. It happens 
whenever the source and the reader do not agree on whether a string is encoded 
in utf-8 or not.

Apparently Encode fixes the incorrect string which is nice. The interesting 
thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably 
have to use Encode as a work-around for some time...


Best regards
Henning Michael Møller Just




-Original Message-
From: David E. Wheeler [mailto:da...@kineticode.com] 
Sent: Wednesday, June 16, 2010 7:56 AM
To: perl-unicode@perl.org
Subject: Variation In Decoding Between Encode and XML::LibXML

Fellow Perlers,

I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that 
appears to mangle an originating Flickr feed. But the curious thing is, when I 
pull the offending string out of the RSS and just stick it in a script, Encode 
knows how to decode it properly, while XML::LibXML (and my Unicode-aware 
editors) cannot.

The attached script demonstrates. $str has the bogus-looking character". 
Encode, however, seems to properly convert it to the "č" in "Laurinavičius" in 
the output. XML::LibXML, OTOH, outputs it as "Laurinavičius" -- that is, 
broken. (If things look truly borked in this email too, please look at the 
attached script.)

So my question is, what gives? Is this truly a broken representation of the 
character and Encode just figures that out and fixes it? Or is there something 
off with my editor and with XML::LibXML.

FWIW, the character looks correct in my editor when I load it from the original 
Flickr feed. It's only after processing by Yahoo! Pipes that it comes out 
looking mangled.

Any insights would be appreciated.

Best,

David




Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Michael Ludwig
David E. Wheeler schrieb am 15.06.2010 um 22:55 (-0700):
> 
> But the curious thing is, when I pull the offending string out of
> the RSS and just stick it in a script, Encode knows how to decode it
> properly, while XML::LibXML (and my Unicode-aware editors) cannot.

Try passing the parser options as a hash reference:

  my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});

In order to print Unicode text strings (as opposed to octet strings)
correctly to a terminal (UTF-8 or not), add the following line before
the first output:

  binmode STDOUT, ':utf8';

But note that STDOUT is global.

Hope this helps!
-- 
Michael Ludwig


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Daisuke Maki
I remember XML::LibXML doing funky things with the utf8 flag -- but in
your case,
is it possible to try using a proper XML declaration?

i.e.:

Tomas 

This seems to produce the correct output for me (perl 5.12.1, LibXML 1.70)

--d

2010/6/16 David E. Wheeler :
> Fellow Perlers,
>
> I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that 
> appears to mangle an originating Flickr feed. But the curious thing is, when 
> I pull the offending string out of the RSS and just stick it in a script, 
> Encode knows how to decode it properly, while XML::LibXML (and my 
> Unicode-aware editors) cannot.
>
> The attached script demonstrates. $str has the bogus-looking character". 
> Encode, however, seems to properly convert it to the "č" in "Laurinavičius" 
> in the output. XML::LibXML, OTOH, outputs it as "LaurinaviÄ ius" -- that is, 
> broken. (If things look truly borked in this email too, please look at the 
> attached script.)
>
> So my question is, what gives? Is this truly a broken representation of the 
> character and Encode just figures that out and fixes it? Or is there 
> something off with my editor and with XML::LibXML.
>
> FWIW, the character looks correct in my editor when I load it from the 
> original Flickr feed. It's only after processing by Yahoo! Pipes that it 
> comes out looking mangled.
>
> Any insights would be appreciated.
>
> Best,
>
> David
>
>
>


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 12:04 AM, Henning Michael Møller Just wrote:

> Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW)

Thanks. Come see my tutorial at OSCON this year, if you can: Test-Driven 
Database Development. :-) Not sure I can make a tutorial as entertaining, alas. 
Perhaps if I bring beer for the audience.

> Which editor do you use? When loading the script in Komodo IDE 5.2 the string 
> looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the 
> second line is correct - the first (no surprise) and third are broken.

Yes, that's how it looks to me in GNU Emacs (compiled from source with cocoa 
bindings).

> Loading the file in UltraEdit-32 13.20+3, set to not convert the script on 
> loading, it becomes obvious that what should have been one character is 
> represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would 
> probably show as 2 characters and as broken.

Right.

> It looks to me like the string is being displayed as a byte representation of 
> the characters, if that makes sense. My english isn't perfect :-/ and what I 
> am trying to say is that this is problem that I am quite familiar with. It 
> happens whenever the source and the reader do not agree on whether a string 
> is encoded in utf-8 or not.
> 
> Apparently Encode fixes the incorrect string which is nice. The interesting 
> thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably 
> have to use Encode as a work-around for some time...

Yes.

Best,

David



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:

> David E. Wheeler schrieb am 15.06.2010 um 22:55 (-0700):
>> 
>> But the curious thing is, when I pull the offending string out of
>> the RSS and just stick it in a script, Encode knows how to decode it
>> properly, while XML::LibXML (and my Unicode-aware editors) cannot.
> 
> Try passing the parser options as a hash reference:
> 
>  my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});

WTF! That fixes it! I don't understand why it seems to be ignoring the encoding 
set in the constructor. But I've noticed the same thing with other options. 
Seems like there's some consistency to be worked out in XML::LibXML options, 
still.

> In order to print Unicode text strings (as opposed to octet strings)
> correctly to a terminal (UTF-8 or not), add the following line before
> the first output:
> 
>  binmode STDOUT, ':utf8';
> 
> But note that STDOUT is global.

Yes, I do this all the time. Surprisingly, I don't get warnings for this 
script, even though it is outputting multibyte characters.

Thanks,

David



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 15, 2010, at 11:24 PM, Daisuke Maki wrote:

> I remember XML::LibXML doing funky things with the utf8 flag -- but in
> your case,
> is it possible to try using a proper XML declaration?
> 
> i.e.:
> 
>Tomas 

No, I'm pulling the example I posted out of the CDATA of an RSS description 
field, and then passing that along XML::LibXML. So there's no encoding 
specified.

> This seems to produce the correct output for me (perl 5.12.1, LibXML 1.70)

Good to know, thanks. I'm wondering now how I'm going to get at the encoding 
from Feed::Data in order to pass that along to XML::LibXML when parsing encoded 
feed content.

Thanks again for the help, folks.

Best,

David



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:

> On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:
> 
>> Try passing the parser options as a hash reference:
>> 
>> my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'});
> 
> WTF! That fixes it! I don't understand why it seems to be ignoring the 
> encoding set in the constructor. But I've noticed the same thing with other 
> options. Seems like there's some consistency to be worked out in XML::LibXML 
> options, still.

Okay, a bit more information: this was not quite it, alas.

>> In order to print Unicode text strings (as opposed to octet strings)
>> correctly to a terminal (UTF-8 or not), add the following line before
>> the first output:
>> 
>> binmode STDOUT, ':utf8';
>> 
>> But note that STDOUT is global.
> 
> Yes, I do this all the time. Surprisingly, I don't get warnings for this 
> script, even though it is outputting multibyte characters.

This is key. If I set the binmode on STDOUT to :utf8, the bogus characters 
print out bogus. If I set it to :raw, they come out right after processing by 
both Encode and XML::LibXML (I'm assuming they're interpreted as latin-1).

So my question is this: Why isn't Encode dying when it runs into these 
characters? They're not valid utf-8, AFAICT. Are they somehow valid utf8 (that 
is, valid in Perl's internal format)? Why would they be?

I think what I need is some code to strip non-utf8 characters from a string -- 
even if that string has the utf8 bit switched on. I thought that Encode would 
do that for me, but in this case apparently not. Anyone got an example?

Thanks,

David




Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread John Delacour

At 22:55 -0700 15/6/10, David E. Wheeler wrote:

...So my question is, what gives? Is this truly a broken 
representation of the character and Encode just figures that out and 
fixes it? Or is there something off with my editor and with 
XML::LibXML.


...Attachment converted: macmini:try.pl (TEXT/CSOm) (000F502E)



When I open your attachment 'try.pl' in BBEdit it has Mac encoding 
and Mac linefeeds and five invisible characters that I haven't 
analysed wherever you have double line-spacing.  And if I tell BBEdit 
to re-open the file as utf-8 I get the warning "The UTF-8 file 
'try.pl' is damaged or badly formed, so it looks to me as if your 
editor may be at fault.


I have BBEdit set to create new documents with UTF-8 encoding and 
UNIX line feeds and to use UTF-8 for I/O.  I gather you don't use 
BBEdit.


JD


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 3:07 PM, John Delacour wrote:

> When I open your attachment 'try.pl' in BBEdit it has Mac encoding and Mac 
> linefeeds and five invisible characters that I haven't analysed wherever you 
> have double line-spacing.  And if I tell BBEdit to re-open the file as utf-8 
> I get the warning "The UTF-8 file 'try.pl' is damaged or badly formed, so it 
> looks to me as if your editor may be at fault.
> 
> I have BBEdit set to create new documents with UTF-8 encoding and UNIX line 
> feeds and to use UTF-8 for I/O.  I gather you don't use BBEdit.

No, but it looks wrong in both GNU Emacs and in TextMate. I really don't 
understand why Encode doesn't strip it out or throw an exception (depending on 
whether CHECK is set to 0 or 1). That's the big question in my mind.

Best,

David



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
> I think what I need is some code to strip non-utf8 characters from a string
> -- even if that string has the utf8 bit switched on. I thought that Encode
> would do that for me, but in this case apparently not. Anyone got an
> example?

Tri this:

Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

That will replace any byte sequences which are invalid UTF-8 with the Unicode
replacement character.  

If you want to guarantee that the flag is on first, do this:

utf8::upgrade($string);
Encode::_utf8_off($string);
$string = Encode::decode('utf8', $string);

Devel::Peek's Dump() function will come in handy for checking results.

Cheers,

Marvin Humphrey



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote:

> On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote:
>> I think what I need is some code to strip non-utf8 characters from a string
>> -- even if that string has the utf8 bit switched on. I thought that Encode
>> would do that for me, but in this case apparently not. Anyone got an
>> example?
> 
> Tri this:
> 
>Encode::_utf8_off($string);
>$string = Encode::decode('utf8', $string);
> 
> That will replace any byte sequences which are invalid UTF-8 with the Unicode
> replacement character.  

Yeah. Not working for me. See attached script. Devel::Peek says:

SV = PV(0x100801f18) at 0x10082f368
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x1002015c0 "Tomas Laurinavi\303\204\302\215ius"\0 [UTF8 
"Tomas Laurinavi\x{c4}\x{8d}ius"]
  CUR = 29
  LEN = 32

So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is 
that crap?

Confused and frustrated,

David
#!/usr/local/bin/perl -w

use 5.12.0;
use Encode;
use Devel::Peek;

my $str = 'Tomas Laurinavičius';
my $utf8 = decode('UTF-8', $str);
say $str;
binmode STDOUT, ':utf8';
say $utf8;

Dump($utf8);


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:

> So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is 
> that crap?

That's octal notation, which I think Dump() uses for any byte greater than 127
and for control characters, so that it can output pure ASCII.  

That sequence is only four bytes: 
  
  mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = "\303\204\302\215"; 
Encode::_utf8_on($s); Dump $s'
  SV = PV(0x801038) at 0x80e880
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"]
CUR = 4   <--- four bytes
LEN = 8
  mar...@smokey:~ $ 

The logical content of the string follows in the second quote:

>  [UTF8 "Tomas Laurinavi\x{c4}\x{8d}ius"]

That's valid UTF-8.

> my $str = 'Tomas Laurinaviius';

In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
does.

  my $str = "Tomas Laurinavi\x{c4}\x{8d}ius"

However, because those code points are both representable as Latin-1, Perl
will create a Latin-1 string.  If you want to force its internal encoding to
UTF-8, you need to do additional work.

  mar...@smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; 
utf8::upgrade($s); Dump $s'
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2012e0 "\304"\0
CUR = 1
LEN = 4
  SV = PV(0x801038) at 0x80e870
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"]
CUR = 2
LEN = 3
  mar...@smokey:~ $ 

> Confused and frustrated,

IMO, to get UTF-8 right consistently in a large Perl system, you need to
understand the internals and you need Devel::Peek at hand.  Perl tries to hide
the details, but there are too many ways for it to fail silently.  ("perl -C",
$YAML::Syck::ImplicitUnicode, etc.)

Marvin Humphrey



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread David E. Wheeler
On Jun 16, 2010, at 6:03 PM, Marvin Humphrey wrote:

> On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:
> 
>> So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What 
>> is that crap?
> 
> That's octal notation, which I think Dump() uses for any byte greater than 127
> and for control characters, so that it can output pure ASCII.  

Okay.

> That sequence is only four bytes: 
> 
>  mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = "\303\204\302\215"; 
> Encode::_utf8_on($s); Dump $s'
>  SV = PV(0x801038) at 0x80e880
>REFCNT = 1
>FLAGS = (POK,pPOK,UTF8)
>PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"]
>CUR = 4   <--- four bytes
>LEN = 8
>  mar...@smokey:~ $ 
> 
> The logical content of the string follows in the second quote:
> 
>> [UTF8 "Tomas Laurinavi\x{c4}\x{8d}ius"]
> 
> That's valid UTF-8.

In what sense? Legally perhaps, but I can make XML::LibXML choke on it.

>> my $str = 'Tomas Laurinaviius';
> 
> In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
> does.
> 
>  my $str = "Tomas Laurinavi\x{c4}\x{8d}ius"

Okay, that makes it easier to test things (I've been pulling stuff out of the 
broken feed I downloaded).

> However, because those code points are both representable as Latin-1, Perl
> will create a Latin-1 string.  If you want to force its internal encoding to
> UTF-8, you need to do additional work.
> 
>  mar...@smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; 
> utf8::upgrade($s); Dump $s'
>  SV = PV(0x801038) at 0x80e870
>REFCNT = 1
>FLAGS = (POK,pPOK)
>PV = 0x2012e0 "\304"\0
>CUR = 1
>LEN = 4
>  SV = PV(0x801038) at 0x80e870
>REFCNT = 1
>FLAGS = (POK,pPOK,UTF8)
>PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"]
>CUR = 2
>LEN = 3
>  mar...@smokey:~ $ 
> 
>> Confused and frustrated,
> 
> IMO, to get UTF-8 right consistently in a large Perl system, you need to
> understand the internals and you need Devel::Peek at hand.  Perl tries to hide
> the details, but there are too many ways for it to fail silently.  ("perl -C",
> $YAML::Syck::ImplicitUnicode, etc.)

Bleh. Such a PITA. I'd like not to have to think about this stuff, but I must 
because other people haven't.

So here's my test:

use 5.12.0;
use Devel::Peek;

my $str = "Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius";
say $str;
utf8::upgrade($str);
binmode STDOUT, ':utf8';
say $str;
Dump $str;

The output it still broken, however, in both cases, looking like this:

Laurinavičius
Laurinavičius
SV = PV(0x100801c78) at 0x10082ac40
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK,UTF8)
  PV = 0x100202170 "Laurinavi\303\203\302\204\303\202\302\215ius"\0 [UTF8 
"Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius"]
  CUR = 20
  LEN = 32

So it may be valid UTF-8, but why does it come out looking like crap? That is, 
"Laurinavičius"? I suppose there's an argument that "Laurinavičius" is 
correct and valid, if ugly. Maybe?

Thanks,

David





Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread David E. Wheeler
On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:

>> So it may be valid UTF-8, but why does it come out looking like crap? That 
>> is, "LaurinaviÃ≥Ÿius"? I suppose there's an > argument that 
>> "LaurinaviÄŸius" is correct and valid, if ugly. Maybe?
> 
> I am unsure if this is the explanation you are looking for but here goes:
> 
> I think the original data contained the character \x{010d}. In utf-8, that 
> means that it should be represented as the bytes \x{c4} and \x{8d}. If those 
> bytes are not marked as in fact being a two-byte utf-8 encoding of a single 
> character, or if an application reading the data mistakenly thinks it is not 
> encoded (both common errors), somewhere along the transmission an application 
> may decide that it needs to re-encode the characters in utf-8. 
> 
> So the original character \x{010d} is represented by the bytes \x{c4} and 
> \x{8d}, an application thinks those are in fact characters and encodes them 
> again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe 
> is your broken data.

I see. That makes sense. FYI, the original source is at:

  
http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22

Look for "Tomas" in the output. If it doesn't show pu, change max=50 to max=75 
or something.

> I think the error comes from Perl's handling of utf-8 data and that this 
> handling has changed in subtle ways all the way since Perl 5.6. We have 
> supported utf-8 in our applications since Perl 5.6 and have experienced this 
> repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of 
> DBD::ODBC Martin Evans provided have given us a lot of work trying to sort 
> out these troubles.

Maintaining the backwards compatibility from the pre-utf8 days must make it far 
more difficult than it otherwise would be.

> I wonder if your code would work fine in Perl 5.8? We are "only" at 5.10(.1) 
> but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works 
> fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML?

In my application, I finally got XML::LibXML to choke on the invalid 
characters, and then found that the problem was that I was running 
Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML. 
Once I removed that, it stopped choking. So clearly zap_cp1252 was changing 
bytes it should not have. I now have it running fix_cp1252 *after* the parsing, 
when everything is already UTF-8. Now that I think about it, though, I should 
probably change it so that it searches on characters instead of bytes when 
working on a utf8 string. Will have to look into that.

In the meantime, I'll just accept that sometimes the characters are valid UTF-8 
and look like shit. Frankly, when I run the above feed through NetNewsWire, the 
offending byte sequence displays as "Ä", just as it does in my app's output. So 
I blame Yahoo.

Thanks for the detailed explanation, Henning, much appreciated.

Best,

David

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread Marvin Humphrey
On Thu, Jun 17, 2010 at 10:17:52AM -0700, David E. Wheeler wrote:

> > The logical content of the string follows in the second quote:
> > 
> >> [UTF8 "Tomas Laurinavi\x{c4}\x{8d}ius"]
> > 
> > That's valid UTF-8.
> 
> In what sense? Legally perhaps, but I can make XML::LibXML choke on it.

There are two valid states for Perl scalars containing string data.

  * SVf_UTF8 flag off.
  * SVf_UTF8 flag on, and string data which is a valid UTF-8 byte sequence.

In both cases, we define the logical content of the string as a series of
Unicode code points.  

If the UTF8 flag is off, then the scalar's data will be interpreted as
Latin-1.  (Except under "use locale" but let's ignore that for now.)  Each
byte will be interpreted as a single code point.  The 256 logical code points
in Latin-1 are identical to the first 256 logical code points in Unicode.
This is by design -- the Unicode consortium chose to overlap with Latin-1
because it was so common.  So any string content that consists solely of code
points 255 and under can be represented in Latin-1 without loss.

In a Perl scalar with the UTF8 flag on, you can get the code points by
decoding the variable width UTF-8 data, with each code point derived by
reading 1-5 bytes.  *Any* sequence of Unicode code points can be represented
without loss.

Unfortunately, it is really, really easy to mess up string handling when
writing XS modules.  A common error is to strip the UTF8 flag accidentally.
This changes the scalar's logical content, as now its string data will be
interpreted as Latin-1 rather than UTF-8.  

A less common error is to turn on the UTF8 flag for a scalar which does not
contain a valid UTF-8 byte sequence.  This puts the scalar into an what I'm
calling an "invalid state".  It will likely bring your program down with a
"panic" error message if you try to do something like run a regex on it.

In your case, the Dump of the scalar demonstrated that it had the UTF8 flag
set and that it contained a valid UTF-8 byte sequence -- a "valid state".
However, it looks like it had invalid content.

A scalar with the UTF8 flag off can never be in an "invalid state", because
any sequence of bytes is valid Latin-1.  However, it's easy to change the
string's logical content by accidentally stripping or forgetting to set the
UTF8 flag.  Unfortunately, this error leads to silent failure -- no error
message, but the content changes -- and it can be really hard to debug.

This fellow's name, which you can see if you visit
, contains Unicode code point 0x010d, "LATIN SMALL
LETTER C WITH CARON".  As that code point is greater than 255, any Perl string
containing his name *must* have the UTF8 flag turned on.  

I strongly suspect that at some point one of the following two things
happened:

* The code was input from a UTF-8 source but the input filehandle was not
  set to UTF-8 -- open (my $fh, '<:encoding(utf8)', $file) or die;
* The flag got stripped and subsequently the UTF-8 data was incorrectly
  reinterpreted as Latin-1.

You typically need Devel::Peek for hunting down the second kind of error.

> > IMO, to get UTF-8 right consistently in a large Perl system, you need to
> > understand the internals and you need Devel::Peek at hand.  Perl tries to 
> > hide
> > the details, but there are too many ways for it to fail silently.  ("perl 
> > -C",
> > $YAML::Syck::ImplicitUnicode, etc.)
> 
> Bleh. Such a PITA. I'd like not to have to think about this stuff, but I
> must because other people haven't.

It's more that getting UTF-8 support into Perl without breaking existing
programs was a truly awesome hack -- but that one of the limitations of that
hack was that the implementation is prone to silent failure.

> So here's my test:
> 
> use 5.12.0;
> use Devel::Peek;
> 
> my $str = "Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius";
> say $str;
> utf8::upgrade($str);
> binmode STDOUT, ':utf8';
> say $str;
> Dump $str;
> 
> The output it still broken, however, in both cases, looking like this:
> 
> Laurinavičius
> Laurinavičius

Let's double check something first.  Based on your mail client (Apple Mail) I
see you're (still) using OS X.  Check out Terminal -> Preferences -> Advanced
-> Character encoding. What's it set to?  If it's not "Unicode (UTF-8)", set
it to that now.

Then try this:

use 5.10.0;
use Devel::Peek;

my $str = "Tomas Laurinavi\x{010d}ius";
say $str;

binmode STDOUT, ':utf8';
say $str;

Dump $str;
utf8::upgrade($str); # no effect
Dump $str;

For me, that prints his name correctly twice.  The first time, though, I get
a "wide character in print" warning.  That warning arises because Perl's
STDOUT is set to Latin-1 by default.  It wants to "downgrade" the UTF8 scalar
to Latin-1, but it can't do so without loss, so it warns and outputs the bytes
as is.  After we change STDOUT to 'utf8', the warning goes away.

The utf8::upgrade() call has no effect, be

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread John Delacour

At 13:24 -0700 17/6/10, David E. Wheeler wrote:

On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:


 So the original character \x{010d} is represented by the bytes 
\x{c4} and \x{8d}, an application thinks those are in fact 
characters and encodes them again as \x{c3} + \x{84} and \x{c2} + 
\x{8d}, respectively. Which I believe is your broken data.


I see. That makes sense. FYI, the original source is at:


http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22




In the meantime, I'll just accept that sometimes the characters are 
valid UTF-8 and look like shit. Frankly, when I run the above feed 
through NetNewsWire, the offending byte sequence displays as "Ä", 
just as it does in my app's output. So I blame Yahoo.



Quite right.  Now I see the file it is clear that the encoding has 
been done twice, each of the two bytes for the c-with-caron being 
again encoded to produce four bytes.


If I save the file and undo the second decoding I get the proper output


#!/usr/bin/perl
use strict;
use Encode;
no warnings;
my $f = "$ENV{HOME}/desktop/pipe.run";
open F, $f;
while (){
print decode("utf-8", $_)
}



JD



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread John Delacour

At 00:27 +0100 18/6/10, I wrote:


If I save the file and undo the second decoding I get the proper output



In this case all talk of iso-8859-1 and cp1252 is a red herring.  I 
read several Italian websites where this same problem is manifest in 
external material such as ads.  The news page proper is encoded 
properly and declared as utf-8 but I imagine the web designers have 
reckoned that the stuff they receive from the advertisers is most 
likely to be received as windows-1252 and convert accordingly rather 
than bother to verify the encoding.  As a result material that is 
received as utf-8 will undergo a superfluous encoding.


Here's a way to get the file in question properly encoded:


#!/usr/bin/perl
use strict;
use LWP::Simple;
use Encode;
no warnings; # avoid wide character warning
my $tempdir = "/tmp";
my $tempfile = "tempfile";
my $f = "$tempdir/$tempfile";
my $uri="http://pipes.yahoo.com/pipes/pipe.run";.
"?Size=Medium&_id=f53b7bed8b88412fab9715a995629722".
"&_render=rss&max=50&nsid=1025993%40N22";
if (getstore($uri, $f)){
  open F, $f or die $!;
  while (){
my $encoding = find_encoding("utf-8");
my $utf8 = $encoding->decode($_);
print $utf8;
  }
  close F;
}
unlink $f;

JD



Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread John Delacour

At 08:05 +0100 18/6/10, John Delacour wrote:


  while (){
my $encoding = find_encoding("utf-8");


That should be

my $encoding = find_encoding("utf-8");
while (){

of course!


RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread Henning Michael Møller Just
> So it may be valid UTF-8, but why does it come out looking like crap? That 
> is, "Laurinavičius"? I suppose there's an > argument that "Laurinavičius" 
> is correct and valid, if ugly. Maybe?

I am unsure if this is the explanation you are looking for but here goes:

I think the original data contained the character \x{010d}. In utf-8, that 
means that it should be represented as the bytes \x{c4} and \x{8d}. If those 
bytes are not marked as in fact being a two-byte utf-8 encoding of a single 
character, or if an application reading the data mistakenly thinks it is not 
encoded (both common errors), somewhere along the transmission an application 
may decide that it needs to re-encode the characters in utf-8. 

So the original character \x{010d} is represented by the bytes \x{c4} and 
\x{8d}, an application thinks those are in fact characters and encodes them 
again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe is 
your broken data.

I think the error comes from Perl's handling of utf-8 data and that this 
handling has changed in subtle ways all the way since Perl 5.6. We have 
supported utf-8 in our applications since Perl 5.6 and have experienced this 
repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of 
DBD::ODBC Martin Evans provided have given us a lot of work trying to sort out 
these troubles.

I wonder if your code would work fine in Perl 5.8? We are "only" at 5.10(.1) 
but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works 
fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML?


Best regards
Henning Michael Møller Just


Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread David E. Wheeler
Marvin,

I can always count on you for a detailed explanation. Thanks. You ought to turn 
this into a blog post!

On Jun 17, 2010, at 4:06 PM, Marvin Humphrey wrote:

> There are two valid states for Perl scalars containing string data.
> 
>  * SVf_UTF8 flag off.
>  * SVf_UTF8 flag on, and string data which is a valid UTF-8 byte sequence.
> 
> In both cases, we define the logical content of the string as a series of
> Unicode code points.  
> 
> If the UTF8 flag is off, then the scalar's data will be interpreted as
> Latin-1.  (Except under "use locale" but let's ignore that for now.)  Each
> byte will be interpreted as a single code point.  The 256 logical code points
> in Latin-1 are identical to the first 256 logical code points in Unicode.
> This is by design -- the Unicode consortium chose to overlap with Latin-1
> because it was so common.  So any string content that consists solely of code
> points 255 and under can be represented in Latin-1 without loss.

Hrm. So am I safe in changing the CP1252 gremlin bytes to proper UTF-8 
characters in Encode::ZapCP1252 like so?

$_[0] =~ s{([\x80-\x9f])}{
$table->{$1} ? Encode::decode('UTF-8', $table->{$1}) : $1
}emxsg

Where `$table` is the lookup table mapping hex values like \x80 to their UTF-8 
equivalents (€)? This is assuming that $_[0] has the UTF8 flag on, of course.

So is this safe? Are \x80-\x9f considered characters when the utf8 flag is on, 
or are they bytes that might break multibyte characters that use those bytes?

> In a Perl scalar with the UTF8 flag on, you can get the code points by
> decoding the variable width UTF-8 data, with each code point derived by
> reading 1-5 bytes.  *Any* sequence of Unicode code points can be represented
> without loss.

Right.

> Unfortunately, it is really, really easy to mess up string handling when
> writing XS modules.  A common error is to strip the UTF8 flag accidentally.
> This changes the scalar's logical content, as now its string data will be
> interpreted as Latin-1 rather than UTF-8.  
> 
> A less common error is to turn on the UTF8 flag for a scalar which does not
> contain a valid UTF-8 byte sequence.  This puts the scalar into an what I'm
> calling an "invalid state".  It will likely bring your program down with a
> "panic" error message if you try to do something like run a regex on it.

Fortunately, I'm not writing XS modules. :-)

> In your case, the Dump of the scalar demonstrated that it had the UTF8 flag
> set and that it contained a valid UTF-8 byte sequence -- a "valid state".
> However, it looks like it had invalid content.

Yes. I broke it with zap_cp1252 (applied before decoding). I just removed that 
and things became valid again. The character was still broken, as it is in the 
feed, but at least it was valid -- and the same as the source.

> A scalar with the UTF8 flag off can never be in an "invalid state", because
> any sequence of bytes is valid Latin-1.  However, it's easy to change the
> string's logical content by accidentally stripping or forgetting to set the
> UTF8 flag.  Unfortunately, this error leads to silent failure -- no error
> message, but the content changes -- and it can be really hard to debug.

Yes, this is what happened to me by zapping the non-utf8 scalar with zap_cp1252 
before decoding it. Bad idea.

> This fellow's name, which you can see if you visit
> , contains Unicode code point 0x010d, "LATIN 
> SMALL
> LETTER C WITH CARON".  As that code point is greater than 255, any Perl string
> containing his name *must* have the UTF8 flag turned on.  
> 
> I strongly suspect that at some point one of the following two things
> happened:
> 
>* The code was input from a UTF-8 source but the input filehandle was not
>  set to UTF-8 -- open (my $fh, '<:encoding(utf8)', $file) or die;

Well, I was pulling it from HTTP::Response->content. I'm not using 
HTTP::Response->decoded_content because it's XML, which should be binary (see 
http://juerd.nl/site.plp/perluniadvice)

>* The flag got stripped and subsequently the UTF-8 data was incorrectly
>  reinterpreted as Latin-1.

> You typically need Devel::Peek for hunting down the second kind of error.

I missed that one, fortunately.

> It's more that getting UTF-8 support into Perl without breaking existing
> programs was a truly awesome hack -- but that one of the limitations of that
> hack was that the implementation is prone to silent failure.

Right. It's an impressive achievement. And I can't wait until DBI 2 is built on 
Rakudo. ;-)

>> The output it still broken, however, in both cases, looking like this:
>> 
>>Laurinavičius
>>Laurinavičius
> 
> Let's double check something first.  Based on your mail client (Apple Mail) I
> see you're (still) using OS X.  Check out Terminal -> Preferences -> Advanced
> -> Character encoding. What's it set to?  If it's not "Unicode (UTF-8)", set
> it to that now.

I always use UTF-8. Snow Leopard actua

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread David E. Wheeler
On Jun 18, 2010, at 12:05 AM, John Delacour wrote:

> In this case all talk of iso-8859-1 and cp1252 is a red herring.  I read 
> several Italian websites where this same problem is manifest in external 
> material such as ads.  The news page proper is encoded properly and declared 
> as utf-8 but I imagine the web designers have reckoned that the stuff they 
> receive from the advertisers is most likely to be received as windows-1252 
> and convert accordingly rather than bother to verify the encoding.  As a 
> result material that is received as utf-8 will undergo a superfluous encoding.
> 
> Here's a way to get the file in question properly encoded:

Yep, that works for me, too. I guess XML::LibXML isn't using Encode in the same 
way to decode content, as it returns the string with the characters as 
\x{c4}\x{8d}.

Thanks for the help, everyone. I've got my code parsing all my feeds and 
emitting a valid UTF-8 feed of its own now.

Best,

David

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-19 Thread Michael Ludwig
David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700):
> On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote:
> > On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote:

> >> In order to print Unicode text strings (as opposed to octet
> >> strings) correctly to a terminal (UTF-8 or not), add the following
> >> line before the first output:
> >> 
> >> binmode STDOUT, ':utf8';
> >> 
> >> But note that STDOUT is global.
> > 
> > Yes, I do this all the time. Surprisingly, I don't get warnings for
> > this script, even though it is outputting multibyte characters.
> 
> This is key. If I set the binmode on STDOUT to :utf8, the bogus
> characters print out bogus. If I set it to :raw, they come out right
> after processing by both Encode and XML::LibXML (I'm assuming they're
> interpreted as latin-1).

Yes, or as raw, which is equivalent. Any octet is valid Latin-1.

> So my question is this: Why isn't Encode dying when it runs into these
> characters? They're not valid utf-8, AFAICT. Are they somehow valid
> utf8 (that is, valid in Perl's internal format)? Why would they be?

Assuming we're talking about the same thing here: They're not
characters, they're octets. (The Perl documentation seems to make
an effort to conceptually distinguish between *octets* and *bytes*,
but they map to the same thing.) I found it helpful to accept that
the notion of a "UTF-8 character" does not make sense: there are
Unicode characters, but UTF-8 is an encoding, and it deals with
octets.

Here's your script with some modifications to illustrate how things
work:

  \,,,/
  (o o)
--oOOo-(_)-oOOo--
use strict;
use Encode;
use XML::LibXML;
# The script is written in UTF-8, but the utf8 pragma is not turned on.
# So the literals in our script yield octet strings, not text strings.
# (Note that it is probably much more convenient to go with the utf8
# pragma if you write your source code in UTF-8.)
my $octets = 'Tomas Laurinavičius';
my $txt= decode_utf8( $octets );
my $txt2   = "Tomas Laurinavi\x{010d}ius";

die if $txt2 ne $txt;# they're equal
die if $txt2 eq $octets; # they're not equal

# print raw UTF-8 octets; looks correct on UTF-8 terminal
print $octets, $/;
# print text containing wide character to narrow character filehandle
print "$txt WARN$/"; # triggers a warning: "Wide character in print"
binmode STDOUT, ':utf8'; # set to utf8, accepting wide characters
print $txt, $/; # print text to terminal
print $octets, $/; # double encoding, č as four bytes

my $parser = XML::LibXML->new;
# specify encoding for octet string
my $doc = $parser->parse_html_string($octets, {encoding => 'utf-8'});
print $doc->documentElement->toString, $/;
# no need to specify encoding for text string
my $doc2 = $parser->parse_html_string($txt);
print $doc2->documentElement->toString, $/;
-- 
Michael Ludwig