Dear Yitzchak,

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of 
Issac Goldstand
Sent: Monday, February 13, 2012 12:30 PM
To: Perl in Israel
Subject: [Israel.pm] Perl unicode question

If there's one thing I can never seem to get straight, it's character 
encodings...

I'm trying to parse some data from the web which can come in different 
encodings, and write unit tests which come from static files.

One of the strings that I'm trying to test for is "Forex Trading Avec 100€"  
The string is originally encoded (supposedly) in ISO-8859-1 based on the header 
Content-Type: text/html; charset=ISO-8859-1 and presence of the following META 
tag <meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">

(N.B. I'm a bit confused by that as IIRC, ISO-8859-1 doesn't contain the EUR 
character...)

When opening the source code in a text editor as either ISO-8859-1 or
ISO-8859-15 (or even UTF-8), I can't see the character.  I *do* see the 
character when viewing it as CP1255 which kinda worries me, as I get the 
feeling I'm a lot farther from the source as I think when I see that...

My unit test for above test is as following:

use utf8; # String literals contain UTF-8 in this file binmode STDOUT ":utf8"; 
...
open($fh, "<:encoding(ISO-8859-1)", "t/html0004.html") || die "...: $!"; 
$parser->parse_file($fh); # Subclassed HTML::Parser ...
is($test->{top}, "Forex Trading Avec 100€", "Correct headline text");

However, this test does not pass on the EURO, giving me the following
result:
Wide character in print at /usr/local/share/perl/5.12.4/Test/Builder.pm
line 1759.
#          got: 'Forex Trading Avec 100€'
#     expected: 'Forex Trading Avec 100€'

Both the warning and the mismatch bother me....  The warning, because I assumed 
that opening STDOUT as a utf8 stream would deal with it.  And the mismatch, 
because I can't figure why it's mismatching...

FWIW, when doing this on the web, I'd planned on converting to utf-8 by using 
HTTP::Response's $res->decoded_content to deal with the encoding for me, but 
that seems to be spewing characters that... don't look correct... too :/

Any ideas?


=======================================

There are a number of things that must be done together so that Unicode will be 
supported. And don't put too much weight on the "charset..." cluse in the HTML.

Since this list does not accept attachments, I'll send to your personal address 
my upcoming presentation on "Unicode aspects in Perl", to be presented in the 
Israel Perl Workshop 2012 (http://act.perl.org.il/ilpw2012/).

Anybody else who is interested is welcomed to ask and I'll send it to her/him 
too.


_______________________________________________
Perl mailing list
[email protected]
http://mail.perl.org.il/mailman/listinfo/perl

Reply via email to