Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-19 Thread Michael Ludwig
David E. Wheeler schrieb am 16.06.2010 um 13:59 (-0700): > On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote: > > On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote: > >> In order to print Unicode text strings (as opposed to octet > >> strings) correctly to a terminal (UTF-8 or not), add the foll

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread David E. Wheeler
On Jun 18, 2010, at 12:05 AM, John Delacour wrote: > In this case all talk of iso-8859-1 and cp1252 is a red herring. I read > several Italian websites where this same problem is manifest in external > material such as ads. The news page proper is encoded properly and declared > as utf-8 but

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread David E. Wheeler
Marvin, I can always count on you for a detailed explanation. Thanks. You ought to turn this into a blog post! On Jun 17, 2010, at 4:06 PM, Marvin Humphrey wrote: > There are two valid states for Perl scalars containing string data. > > * SVf_UTF8 flag off. > * SVf_UTF8 flag on, and string d

RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread Henning Michael Møller Just
> So it may be valid UTF-8, but why does it come out looking like crap? That > is, "Laurinavičius"? I suppose there's an > argument that "Laurinavičius" > is correct and valid, if ugly. Maybe? I am unsure if this is the explanation you are looking for but here goes: I think the original dat

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread John Delacour
At 08:05 +0100 18/6/10, John Delacour wrote: while (){ my $encoding = find_encoding("utf-8"); That should be my $encoding = find_encoding("utf-8"); while (){ of course!

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-18 Thread John Delacour
At 00:27 +0100 18/6/10, I wrote: If I save the file and undo the second decoding I get the proper output In this case all talk of iso-8859-1 and cp1252 is a red herring. I read several Italian websites where this same problem is manifest in external material such as ads. The news page pro

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread John Delacour
At 13:24 -0700 17/6/10, David E. Wheeler wrote: On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: So the original character \x{010d} is represented by the bytes \x{c4} and \x{8d}, an application thinks those are in fact characters and encodes them again as \x{c3} + \x{84} and

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread Marvin Humphrey
On Thu, Jun 17, 2010 at 10:17:52AM -0700, David E. Wheeler wrote: > > The logical content of the string follows in the second quote: > > > >> [UTF8 "Tomas Laurinavi\x{c4}\x{8d}ius"] > > > > That's valid UTF-8. > > In what sense? Legally perhaps, but I can make XML::LibXML choke on it. There ar

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread David E. Wheeler
On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote: >> So it may be valid UTF-8, but why does it come out looking like crap? That >> is, "LaurinaviÃ≥Ÿius"? I suppose there's an > argument that >> "LaurinaviÄŸius" is correct and valid, if ugly. Maybe? > > I am unsure if this is the

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-17 Thread David E. Wheeler
On Jun 16, 2010, at 6:03 PM, Marvin Humphrey wrote: > On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: > >> So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What >> is that crap? > > That's octal notation, which I think Dump() uses for any byte greater than

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote: > So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What is > that crap? That's octal notation, which I think Dump() uses for any byte greater than 127 and for control characters, so that it can output pure ASC

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 4:47 PM, Marvin Humphrey wrote: > On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: >> I think what I need is some code to strip non-utf8 characters from a string >> -- even if that string has the utf8 bit switched on. I thought that Encode >> would do that for

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Marvin Humphrey
On Wed, Jun 16, 2010 at 01:59:33PM -0700, David E. Wheeler wrote: > I think what I need is some code to strip non-utf8 characters from a string > -- even if that string has the utf8 bit switched on. I thought that Encode > would do that for me, but in this case apparently not. Anyone got an > examp

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 3:07 PM, John Delacour wrote: > When I open your attachment 'try.pl' in BBEdit it has Mac encoding and Mac > linefeeds and five invisible characters that I haven't analysed wherever you > have double line-spacing. And if I tell BBEdit to re-open the file as utf-8 > I get th

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread John Delacour
At 22:55 -0700 15/6/10, David E. Wheeler wrote: ...So my question is, what gives? Is this truly a broken representation of the character and Encode just figures that out and fixes it? Or is there something off with my editor and with XML::LibXML. ...Attachment converted: macmini:try.pl (TEXT

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 9:05 AM, David E. Wheeler wrote: > On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote: > >> Try passing the parser options as a hash reference: >> >> my $doc = $parser->parse_html_string($str, {encoding => 'utf-8'}); > > WTF! That fixes it! I don't understand why it seems to

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 15, 2010, at 11:24 PM, Daisuke Maki wrote: > I remember XML::LibXML doing funky things with the utf8 flag -- but in > your case, > is it possible to try using a proper XML declaration? > > i.e.: > >Tomas No, I'm pulling the example I posted out of the CDATA of an RSS description

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 2:34 AM, Michael Ludwig wrote: > David E. Wheeler schrieb am 15.06.2010 um 22:55 (-0700): >> >> But the curious thing is, when I pull the offending string out of >> the RSS and just stick it in a script, Encode knows how to decode it >> properly, while XML::LibXML (and my Unic

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread David E. Wheeler
On Jun 16, 2010, at 12:04 AM, Henning Michael Møller Just wrote: > Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW) Thanks. Come see my tutorial at OSCON this year, if you can: Test-Driven Database Development. :-) Not sure I can make a tutorial as entertaining, alas. Pe

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Daisuke Maki
I remember XML::LibXML doing funky things with the utf8 flag -- but in your case, is it possible to try using a proper XML declaration? i.e.: Tomas This seems to produce the correct output for me (perl 5.12.1, LibXML 1.70) --d 2010/6/16 David E. Wheeler : > Fellow Perlers, > > I'm par

Re: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Michael Ludwig
David E. Wheeler schrieb am 15.06.2010 um 22:55 (-0700): > > But the curious thing is, when I pull the offending string out of > the RSS and just stick it in a script, Encode knows how to decode it > properly, while XML::LibXML (and my Unicode-aware editors) cannot. Try passing the parser options

RE: Variation In Decoding Between Encode and XML::LibXML

2010-06-16 Thread Henning Michael Møller Just
Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW) Which editor do you use? When loading the script in Komodo IDE 5.2 the string looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second line is correct - the first (no surprise) and third are broken.