Encode::Mangled?

2009-05-29 Thread Richard Huxton
I'm dealing with data from a web-page that claims to be ISO-8859-1 but 
actually has some Win-1252 embedded in it. I can convert it to UTF-8 and 
all seems well, however the characters need mapping. It's 
straightforward enough to handle the dozen or so chars I know about but 
I can't believe there isn't something on cpan for this.


Concrete example:
Page claims 8859-1 but has the character equivalent to • in it 
(displays as a bullet). Note this isn't the HTML entity, it's a single 
byte = 149. It looks fine in a web-browser because presumably the 
browser special-cases it.
I can happily convert this to UTF-8 and store it (xC295), but it's not a 
displaying unicode character (and certainly not the bullet-point). The 
equivalent should be: 8226.
I *think* I'm safe in treating 8859-1 as win1252 since the latter is a 
strict superset. That's not going to work with 8859-15 though.


Now the *correct* solution is to track down the people responsible for 
this travesty and beat them with sticks. Failing that, are people just 
rolling their own three-line function each time?


--
  Richard Huxton
  Archonet Ltd


Re: Encode::Mangled?

2009-05-29 Thread Robin Berjon

Hi,

On May 29, 2009, at 10:55 , Richard Huxton wrote:

Concrete example:
Page claims 8859-1 but has the character equivalent to • in it  
(displays as a bullet). Note this isn't the HTML entity, it's a  
single byte = 149. It looks fine in a web-browser because presumably  
the browser special-cases it.
I can happily convert this to UTF-8 and store it (xC295), but it's  
not a displaying unicode character (and certainly not the bullet- 
point). The equivalent should be: 8226.
I *think* I'm safe in treating 8859-1 as win1252 since the latter is  
a strict superset. That's not going to work with 8859-15 though.


Sorry, I'm not sure I understand precisely what the issue is. Can you  
not simply use Encode to convert it from CP-1252 to UTF-8? In "legacy  
situations" (i.e. broken HTML, most of the web), browsers normally  
default to CP-1252 if they haven't detected another encoding as it  
generally works, even for ISO-8859-1.


If the page claims to be in ISO-8859-15 then the chances are that  
whoever is sending it to you know what they're doing, and you can just  
use the real thing.


Or am I missing something?

If you're trying to process this in as much as possible the same way  
that browsers do, the algorithm to follow is pretty scary, but should  
get you covered:


  http://dev.w3.org/html5/spec/#determining-the-character-encoding

That actually might be worth a HTML5::DetectEncoding contribution to  
CPAN as it certainly would help improve scrapers and friends.


--
Robin Berjon - http://berjon.com/
Feel like hiring me? Go to http://robineko.com/







Re: Encode::Mangled?

2009-05-29 Thread Dave Hodgkinson

On 29 May 2009, at 09:55, Richard Huxton wrote:

I'm dealing with data from a web-page that claims to be ISO-8859-1  
but actually has some Win-1252 embedded in it. I can convert it to  
UTF-8 and all seems well, however the characters need mapping. It's  
straightforward enough to handle the dozen or so chars I know about  
but I can't believe there isn't something on cpan for this.


ZapCP1252. Just nukes all aberrant characters. Thanks to Joel for  
pointing

that out for me.


--
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://davehodg.blogspot.com
Photos: http://www.flickr.com/photos/davehodg









Re: Encode::Mangled?

2009-05-29 Thread Richard Huxton

Dave Hodgkinson wrote:

ZapCP1252. Just nukes all aberrant characters. Thanks to Joel for pointing
that out for me.


That's what I was after. Ta very much.

http://search.cpan.org/~dwheeler/Encode-ZapCP1252-0.12/

--
  Richard Huxton
  Archonet Ltd


Re: Encode::Mangled?

2009-05-29 Thread Ben Evans

Richard Huxton wrote:
I'm dealing with data from a web-page that claims to be ISO-8859-1 but 
actually has some Win-1252 embedded in it. I can convert it to UTF-8 
and all seems well, however the characters need mapping. It's 
straightforward enough to handle the dozen or so chars I know about 
but I can't believe there isn't something on cpan for this.


Now the *correct* solution is to track down the people responsible for 
this travesty and beat them with sticks. Failing that, are people just 
rolling their own three-line function each time?


Sticks.

I've heard the standard management argument that "it'll take longer to 
fix it upstream and cost more than working around it, and anyay the 
broken data source will be going away real soon now..." more times than 
I care to think about.


Not only has it never been correct, it has never been within 1 order of 
magnitude of being correct. Sadly, the bleed and wastage that these 
types of idiocies incur is not something which is easily separately 
tracked - it just falls into the noise of "general development entropy".


So push back hard, and get the damn thing fixed upstream, where it 
should be done. If the managers ultimately refuse then use Dave's 
solution and just aggressively trim errant crap out of the feed - and 
include clear documentation as comments in your code as to what you're 
doing and why - that way if people whinge you (or the next guy) know 
where to point them.


Ben



Re: Encode::Mangled?

2009-05-29 Thread Richard Huxton

Robin Berjon wrote:
If the page claims to be in ISO-8859-15 then the chances are that 
whoever is sending it to you know what they're doing, and you can just 
use the real thing.


Or am I missing something?


That's exactly it. The pages in question are claiming 8859-1, but 
they're not (well, not wholly). Presumably someone pastes content in 
from a MS-Word document and it contains bullet-points with invalid 
code-points. Your web-browser copes fine, of course.


Now I could just convert from win-1252 every time the page claims 
8859-1. That's not going to work for 8859-15 where you might have a Euro 
char on the page that's in a different code-point in win-1252.


So - what I've got at the moment is an ugly* tr/// to map the 20 or so 
chars in question. However, Dave H's suggestion looks like it might do 
the trick in a more transparent way.


* It's not the tr/// that's the problem, it's the fact that you need 
eight lines of documentation to explain it, and if I've got a typo 
somewhere in the hex-codes I'll probably never notice, which means 
writing test cases which means...


--
  Richard Huxton
  Archonet Ltd


Re: Encode::Mangled?

2009-05-29 Thread Richard Huxton

Ben Evans wrote:


So push back hard, and get the damn thing fixed upstream, where it 
should be done.


Unfortunately, only a time-traveller can fix it upstream. All I can do 
is make sure it goes no further.


If only we could get Joanna Lumley interested in the shoddy quality of 
much of the internet's character encodings...


--
  Richard Huxton
  Archonet Ltd


Re: Anniversary Beer

2009-05-29 Thread Joel Bernstein

Did anything happen about this? Is anniversary beer still on?
I'm guessing with June social just days away it's not happening /this/  
month, but did enough people get involved to make it happen at all?


/joel

On 1 May 2009, at 08:50, James Laver wrote:


We've almost got enough to make it worth the brewery's while.

Anyone interested who hasn't yet waved their hand?

(All the people who bought what went unclaimed at the social last  
year, perhaps?)


--James

On 20 Apr 2009, at 22:30, James Laver wrote:

As some of you will remember, last year I organised commemorative  
beer for our 10th anniversary (go london.pm!)


Yesterday, a london.pm member twittered about how much he enjoyed  
drinking some of it in the sunshine. This evening, another member  
IRCd about how much he enjoyed it. And then other people have said  
they really enjoyed it and wish they had some left.


So I offered to reorder. This year we'll order it early to make the  
most of the summer.


We need 120 bottles to make it worth their while and they'll  
deliver to the june social (which will fall on the 4th and hasn't  
yet been announced -- it is quite a way away).


I've got 42 bottles confirmed so far.

If you want in, please message me offlist. Please also make sure  
that you will be there to pick up the beer or make arrangements for  
someone else to.


I'll put a 3 week cap on orders, so don't delay!

Cheers,
--James







Italian Perl Workshop 2009

2009-05-29 Thread Stefano Rodighiero
Hello,

Perl.It (Italian Perl users) and Pisa.pm (Pisa Perl user group), in
cooperation with IIT-CNR and with the patronage of Comune di Pisa, are
organizing the 5th edition of Italian Perl Workshop (IPW 2009).

The conference will be held in Pisa, at the "Area di Ricerca"
(Research Area) of the CNR (National Research Centre) on 22 and 23
October 2009.

This is a non-profit event and is the national conference on the Perl
language and related technologies.

This Workshop aims to be an opportunity for Perl users to meet each
others, both professionals and amateurs. It's also a great occasion
for people who don't know about Perl but want to know something more
about its features, its culture and the community around it.

The event is free.

More info: http://conferences.yapceurope.org/ipw2009/
To register: http://conferences.yapceurope.org/ipw2009/register

s.

-- 
www.stefanorodighiero.net


Re: Anniversary Beer

2009-05-29 Thread James Laver

On 29 May 2009, at 14:34, Joel Bernstein wrote:


Did anything happen about this? Is anniversary beer still on?
I'm guessing with June social just days away it's not happening / 
this/ month, but did enough people get involved to make it happen at  
all?


/joel


We hit critical mass, I just didn't get around to arguing.

July.

And thanks for giving me a boot up the arse to call the brewery.

--James


Re: Italian Perl Workshop 2009

2009-05-29 Thread Hakim Cassimally
2009/5/29 Stefano Rodighiero :
 5th edition of Italian Perl Workshop (IPW 2009).
>
> The conference will be held in Pisa, at the "Area di Ricerca"
> (Research Area) of the CNR (National Research Centre) on 22 and 23
> October 2009.

Pisa is easy to get to from the "London" airports (Gatwick, Stansted,
Luton hahahahaha) and a number of notLondon airports too - Brum and
Liverpool for example.

Last year's English track included talks from mst, Tim Bunce, rgs,
Marcus Ramberg (and me ;-) so I hope some London.pmers get around to
submitting something.

And, to keep this on topic, as mst said of Pisa [1]:

  Don't underestimate the double malt beers either :)

osfameron

[1]: http://www.perl.it/blog/archives/000614.html -- scroll down a bit
for the quotes in English


Re: Italian Perl Workshop 2009

2009-05-29 Thread Nicholas Clark
On Fri, May 29, 2009 at 03:43:13PM +0100, Hakim Cassimally wrote:
> 2009/5/29 Stefano Rodighiero :
>  5th edition of Italian Perl Workshop (IPW 2009).
> >
> > The conference will be held in Pisa, at the "Area di Ricerca"
> > (Research Area) of the CNR (National Research Centre) on 22 and 23
> > October 2009.
> 
> Pisa is easy to get to from the "London" airports (Gatwick, Stansted,
> Luton hahahahaha) and a number of notLondon airports too - Brum and

I think you mean Gatwick, Luton and "Cambridge South".
Luton is nearer than Stansted, and is no harder to get to. (If not easier,
for example the trains run 24 hours a day to it.)

Nicholas Clark


"London" airports Re: Italian Perl Workshop 2009

2009-05-29 Thread Hakim Cassimally
2009/5/29 Nicholas Clark :
> On Fri, May 29, 2009 at 03:43:13PM +0100, Hakim Cassimally wrote:
>> 2009/5/29 Stefano Rodighiero :
>>  5th edition of Italian Perl Workshop (IPW 2009).
>
>> Pisa is easy to get to from the "London" airports (Gatwick, Stansted,
>> Luton hahahahaha) and a number of notLondon airports too - Brum and
>
> I think you mean Gatwick, Luton and "Cambridge South".
> Luton is nearer than Stansted, and is no harder to get to. (If not easier,
> for example the trains run 24 hours a day to it.)

Ah!  To be honest, the "hahahaha" was because I used to live near
Luton, and the "London" name for it still makes me chuckle.

I do remember us driving to Stansted arport back in the day, mainly me
and my little brother whining at our dad when he got lost somewhere in
the twisty roads around Much Hadham and Little Hadham...

osfameron


Re: "London" airports Re: Italian Perl Workshop 2009

2009-05-29 Thread Roger Burton West
On Fri, May 29, 2009 at 04:09:54PM +0100, Hakim Cassimally wrote:

>Ah!  To be honest, the "hahahaha" was because I used to live near
>Luton, and the "London" name for it still makes me chuckle.

Not quite as silly as London Southend, though.

R


Re: Italian Perl Workshop 2009

2009-05-29 Thread Dave Hodgkinson


On 29 May 2009, at 15:52, Nicholas Clark wrote:


On Fri, May 29, 2009 at 03:43:13PM +0100, Hakim Cassimally wrote:

2009/5/29 Stefano Rodighiero :
 5th edition of Italian Perl Workshop (IPW 2009).


The conference will be held in Pisa, at the "Area di Ricerca"
(Research Area) of the CNR (National Research Centre) on 22 and 23
October 2009.


Pisa is easy to get to from the "London" airports (Gatwick, Stansted,
Luton hahahahaha) and a number of notLondon airports too - Brum and


I think you mean Gatwick, Luton and "Cambridge South".
Luton is nearer than Stansted, and is no harder to get to. (If not  
easier,

for example the trains run 24 hours a day to it.)



The stansted coach links are pretty impressive. Every few minutes  
during the

day and not so bad at oh-fuck-hundred.

--
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://davehodg.blogspot.com
Photos: http://www.flickr.com/photos/davehodg









Re: "London" airports Re: Italian Perl Workshop 2009

2009-05-29 Thread Nicholas Clark
On Fri, May 29, 2009 at 04:17:22PM +0100, Roger Burton West wrote:
> On Fri, May 29, 2009 at 04:09:54PM +0100, Hakim Cassimally wrote:
> 
> >Ah!  To be honest, the "hahahaha" was because I used to live near
> >Luton, and the "London" name for it still makes me chuckle.
> 
> Not quite as silly as London Southend, though.

How does the public transport link compare with "London Ashford Airport"?
http://www.lydd-airport.co.uk/

Google doesn't even offer "public transport" so I had to opt for walking:

http://maps.google.com/maps?f=d&source=s_d&saddr=TN29+9QL&daddr=13+Eyre+Street+Hill,+Clerkenwell,+London,+EC1R+5E&hl=en&mra=ls&dirflg=w&sll=51.239566,0.411987&sspn=1.105653,2.189026&ie=UTF8&ll=51.237847,0.411987&spn=1.105694,2.189026&t=h&z=9

Make sure you land at least 23 hours and 5 minutes before the June social.

Nicholas Clark



Re: Italian Perl Workshop 2009

2009-05-29 Thread Dave Hodgkinson


On 29 May 2009, at 15:52, Nicholas Clark wrote:


On Fri, May 29, 2009 at 03:43:13PM +0100, Hakim Cassimally wrote:

2009/5/29 Stefano Rodighiero :
 5th edition of Italian Perl Workshop (IPW 2009).


The conference will be held in Pisa, at the "Area di Ricerca"
(Research Area) of the CNR (National Research Centre) on 22 and 23
October 2009.


Pisa is easy to get to from the "London" airports (Gatwick, Stansted,
Luton hahahahaha) and a number of notLondon airports too - Brum and


I think you mean Gatwick, Luton and "Cambridge South".
Luton is nearer than Stansted, and is no harder to get to. (If not  
easier,

for example the trains run 24 hours a day to it.)



The stansted coach links are pretty impressive. Every few minutes  
during the

day and not so bad at oh-fuck-hundred.

--
Dave HodgkinsonMSN: daveh...@hotmail.com
Site: http://www.davehodgkinson.com  UK: +44 7768 490620
Blog: http://davehodg.blogspot.com
Photos: http://www.flickr.com/photos/davehodg









Re: Italian Perl Workshop 2009

2009-05-29 Thread Dominic Thoreau
2009/5/29 Hakim Cassimally :
>
> Pisa is easy to get to from the "London" airports (Gatwick, Stansted,
> Luton hahahahaha) and a number of notLondon airports too - Brum and
> Liverpool for example.

I did see an advert a few years ago by some exceedingly cheap airline
(who's name I forget) trying to claim they flew into London
Southend
-- 
Better to remain silent and be thought a fool than to speak out and
remove all doubt.
-- Abraham Lincoln