Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

Gregory Williams Sun, 06 Aug 2017 18:22:12 -0700

On Sat, 5 Aug 2017 12:16:04 -0400 gregor herrmann <gre...@debian.org> wrote:
> What helps is:
> - replace in lib/HTML/HTML5/Parser.pm
>   $response->{decoded_content} with $response->{content}
>   which feels a bit dangerous
> - or in lib/HTML/HTML5/Parser/UA.pm's get:
>   move the
>   if ($uri =~ /^file:/i)
>   up so it's the first alternative and then _get_fs is used
> 
> 
> The latter change would be, as a diff:
> 
> #v+
> --- a/lib/HTML/HTML5/Parser/UA.pm
> +++ b/lib/HTML/HTML5/Parser/UA.pm
> @@ -18,14 +18,14 @@ sub get
>  {
>         my ($class, $uri, $ua) = @_;
> 
> +       if ($uri =~ /^file:/i)
> +               { goto \&_get_fs }
>         if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
>                 { goto \&_get_tiny }
>         if (ref $ua and $ua->isa('LWP::UserAgent'))
>                 { goto \&_get_lwp }
>         if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
>                 { goto \&_get_lwp }
> -       if ($uri =~ /^file:/i)
> -               { goto \&_get_fs }
> 
> 
> 
> While this helps for reading local files, I guess the _get_lwp() case
> might still be buggy.



I also looked into this and found another possible fix:

diff -ru HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm 
HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm
--- HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm    2013-07-08 
07:12:25.000000000 -0700
+++ HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm    2017-08-06 
12:42:58.000000000 -0700
@@ -13,6 +13,7 @@
 use HTML::HTML5::Parser::TagSoupParser;
 use Scalar::Util qw(blessed);
 use URI::file;
+use Encode qw(encode_utf8);
 use XML::LibXML;
 
 BEGIN {
@@ -102,6 +103,11 @@
        {
         # XXX AGAIN DO THIS TO STOP ENORMOUS MEMORY LEAKS
         my ($errh, $errors) = @{$self}{qw(error_handler errors)};
+        
+        if (utf8::is_utf8($text)) {
+               $text   = encode_utf8($text);
+        }
+        
                $self->{parser}->parse_byte_string(
             $opts->{'encoding'}, $text, $dom,
             sub {


Part of the underlying issue here is that many variables and methods in these 
modules are named in a confusing way, expecting/requiring encoded bytes, but 
using names which imply a desire for decoded strings.

The above patch should handle the LWP case which the previously suggest patch 
avoids. It still passes the test suite (which should probably be improved to 
verify this case), and also supports the test case detailed in this bug report 
(though I should mention that I believe the test script included by Vincent 
Lefevre includes a double-encoding bug as $doc->toString() actually returns 
utf8 encoded bytes, which the :encoding(UTF-8) PerlIO layer on stdout will 
attempt to encode a second time).

thanks,
.greg

Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

Reply via email to