Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
On 1/16/2012 9:52 PM, Bjoern Hoehrmann wrote:
> * Christopher J. Madsen wrote:
>>> Your UTF-8 validation code seems wrong to me, you consider the sequence
>>> F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
>>> the chart in .
>>
>> I guess the RE could be improved, but I'm not sure it's worth the effort
>> and added complication to catch a tiny fraction of false positives.
> 
> Why make the check at all if you don't care if it's right?

I can't use a simple utf8::decode check, because I read a fixed number
of bytes, and that might have cut a multi-byte character in half.  So I
use Encode::FB_QUIET, and then check the leftovers to make sure that
it's a single, plausible, partial UTF-8 character.  I have to check the
leftovers, or the whole test would be meaningless.  I just make sure
it's a start byte followed by an appropriate number of continuation bytes.

As you say, certain start bytes can't validly be followed by certain
continuation bytes, but writing an RE for those rules is more complexity
than I think the problem warrants.  What are the odds that I had 1021
bytes of valid UTF-8 (including at least 1 multi-byte character)
followed by bytes that match my current RE but a strict test could have
rejected?  I'm already just assuming that the next bytes would be
additional continuation bytes.

>>> Anyway, if people think this is the way to go, maybe HTTP::Message can
>>> adopt the Content-Type header charset extraction tests in HTML::Encoding
>>> so they don't get lost as my module becomes redundant?
>>
>> I thought it already did that?
> 
> Not as far as I can tell; links welcome though.

At the beginning of content_charset, it calls content_type_charset
(which is actually a HTTP::Headers method).

Or were you talking about t/01http.t and its associated input files?

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Bjoern Hoehrmann
* Christopher J. Madsen wrote:
>Dropping support for UTF-32 from HTTP::Message is a separate issue from
>removing HTML::Parser.  I've got no comment on that.

(It's not quite as black and white as that, "HTML5" could be exempted in
the algorithm, for instance.)

>> Your UTF-8 validation code seems wrong to me, you consider the sequence
>> F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
>> the chart in .
>
>I guess the RE could be improved, but I'm not sure it's worth the effort
>and added complication to catch a tiny fraction of false positives.

Why make the check at all if you don't care if it's right?

>> Anyway, if people think this is the way to go, maybe HTTP::Message can
>> adopt the Content-Type header charset extraction tests in HTML::Encoding
>> so they don't get lost as my module becomes redundant?
>
>I thought it already did that?

Not as far as I can tell; links welcome though.
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
On 1/16/2012 6:53 PM, Bjoern Hoehrmann wrote:
> * Christopher J. Madsen wrote:
>> My repo is https://github.com/madsen/io-html but since it's built with
>> dzil, I also made a gist of the processed module to make it easier to
>> read the docs: https://gist.github.com/1623654

> It is not clear to me that the combination would actually conform to the
> "HTML5" proposal, for instance, HTTP::Message seems to recognize UTF-32
> BOMs, but as I recall the "HTML5" proposal does not allow that.

Dropping support for UTF-32 from HTTP::Message is a separate issue from
removing HTML::Parser.  I've got no comment on that.

> Your UTF-8 validation code seems wrong to me, you consider the sequence
> F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
> the chart in .

I guess the RE could be improved, but I'm not sure it's worth the effort
and added complication to catch a tiny fraction of false positives.

> Anyway, if people think this is the way to go, maybe HTTP::Message can
> adopt the Content-Type header charset extraction tests in HTML::Encoding
> so they don't get lost as my module becomes redundant?

I thought it already did that?

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net  



Re: Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Bjoern Hoehrmann
* Christopher J. Madsen wrote:
>My repo is https://github.com/madsen/io-html but since it's built with
>dzil, I also made a gist of the processed module to make it easier to
>read the docs: https://gist.github.com/1623654
>
>I took a quick look at HTTP::Message, and I think you'd just need to do
>
>elsif ($self->content_is_html) {
>   require IO::HTML;
>   my $charset = IO::HTML::find_charset_in($$cref);
>   return $charset if $charset;
>}
>
>You're already doing the BOM and valid-UTF8 checks; all you need is the
> check, which is what find_charset_in does.

It is not clear to me that the combination would actually conform to the
"HTML5" proposal, for instance, HTTP::Message seems to recognize UTF-32
BOMs, but as I recall the "HTML5" proposal does not allow that.

Your UTF-8 validation code seems wrong to me, you consider the sequence
F0 80 to be incomplete, but it's actually invalid, same for ED 80, see
the chart in .

Anyway, if people think this is the way to go, maybe HTTP::Message can
adopt the Content-Type header charset extraction tests in HTML::Encoding
so they don't get lost as my module becomes redundant?
-- 
Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


Freeing HTTP::Message from HTML::Parser dependency

2012-01-16 Thread Christopher J. Madsen
I stumbled across this bug:

  https://rt.cpan.org/Ticket/Display.html?id=66313

and a discussion here about removing HTTP::Message's dependency on
HTML::Parser (which needs a C compiler) for charset sniffing.

As it happens, I'm about to release a new dist that implements the HTML5
encoding sniffing algorithm in pure-Perl with no non-core dependencies
for 5.8+.  While its primary function is to make it dead simple to open
a HTML file and get the right encoding layer applied automatically, it
also exposes the underlying mechanism.

My repo is https://github.com/madsen/io-html but since it's built with
dzil, I also made a gist of the processed module to make it easier to
read the docs: https://gist.github.com/1623654

I took a quick look at HTTP::Message, and I think you'd just need to do

elsif ($self->content_is_html) {
require IO::HTML;
my $charset = IO::HTML::find_charset_in($$cref);
return $charset if $charset;
}

You're already doing the BOM and valid-UTF8 checks; all you need is the
 check, which is what find_charset_in does.

One possible issue is that find_charset_in returns Perl's canonical name
for the encoding, which is not necessarily the same as the  MIME or IANA
charset name.  You could do

  return Encode::find_encoding($charset)->mime_name if $charset;

if you want.

I'm planning to release this in a week or so, after I see if any more
bugs pop up or I think of any API changes I should make.

-- 
Chris Madsen  p...@cjmweb.net
    http://www.cjmweb.net