Freeing HTTP::Message from HTML::Parser dependency

Christopher J. Madsen Mon, 16 Jan 2012 15:56:35 -0800

I stumbled across this bug:

  https://rt.cpan.org/Ticket/Display.html?id=66313


and a discussion here about removing HTTP::Message's dependency on
HTML::Parser (which needs a C compiler) for charset sniffing.

As it happens, I'm about to release a new dist that implements the HTML5
encoding sniffing algorithm in pure-Perl with no non-core dependencies
for 5.8+.  While its primary function is to make it dead simple to open
a HTML file and get the right encoding layer applied automatically, it
also exposes the underlying mechanism.

My repo is https://github.com/madsen/io-html but since it's built with
dzil, I also made a gist of the processed module to make it easier to
read the docs: https://gist.github.com/1623654

I took a quick look at HTTP::Message, and I think you'd just need to do

    elsif ($self->content_is_html) {
        require IO::HTML;
        my $charset = IO::HTML::find_charset_in($$cref);
        return $charset if $charset;
    }

You're already doing the BOM and valid-UTF8 checks; all you need is the
<meta> check, which is what find_charset_in does.

One possible issue is that find_charset_in returns Perl's canonical name
for the encoding, which is not necessarily the same as the  MIME or IANA
charset name.  You could do

  return Encode::find_encoding($charset)->mime_name if $charset;

if you want.

I'm planning to release this in a week or so, after I see if any more
bugs pop up or I think of any API changes I should make.

-- 
Chris Madsen                                          p...@cjmweb.net
  --------------------  http://www.cjmweb.net  --------------------

Freeing HTTP::Message from HTML::Parser dependency

Reply via email to