Re: [xml] [PATCH] less-than character and HTML parser module

Daniel Veillard Thu, 16 Apr 2015 01:33:13 -0700

On Tue, Apr 14, 2015 at 05:43:42PM +0200, Christian Schoenebeck wrote:
> On Tuesday 14 April 2015 17:50:51 you wrote:
> > If anything like this does get put in, it should only be if it is a
> > configurable option that is disabled by default - an xml parser should
> > only accept a strictly-conforming document by default. Adding support for
> > ‘broken’ html because other (weak) parsers allow it is not a good plan as
> > it causes divergence from the standard.
> 
> There you go; you find the updated patch attached. It now requires 
> HTML_PARSE_RECOVER option to be set for recovering from stand-alone less-than 
> characters.


That sounds fine *except* it doesn't raise an error.
The parser knows it's a broken construct that must be pointed out.

thinkpad2:~/XML -> ./xmllint --html tst.html
tst.html:3: HTML parser error : htmlParseStartTag: invalid element name
<p> blah < booh </p>
          ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p> blah 
</p>
</body>
</html>
thinkpad2:~/XML -> ./xmllint --html --recover tst.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd";>
<html>
<body>
<p> blah &lt; booh </p>
</body>
</html>
thinkpad2:~/XML -> 

 the fact that we worked around a broken start tag construct must be reported.
Whether we do that with the recovery option or not is less important IMHO.

 It sounds a bit weird to handle that error case as one of the main content
cases, I would still be tempted to go into htmlParseStartTag, get the
error reported, but push corrective data instead in recover mode.

 Can we get a v3 ? :-)

  thanks

Daniel


> Best regards,
> Christian Schoenebeck

> diff -u libxml2-2.9.1+dfsg1.orig/HTMLparser.c libxml2-2.9.1+dfsg1/HTMLparser.c
> --- libxml2-2.9.1+dfsg1.orig/HTMLparser.c     2015-04-14 13:05:01.000000000 
> +0200
> +++ libxml2-2.9.1+dfsg1/HTMLparser.c  2015-04-14 18:22:41.143973776 +0200
> @@ -2948,8 +2948,10 @@
>  
>  
>  /**
> - * htmlParseCharData:
> + * htmlParseCharDataInternal:
>   * @ctxt:  an HTML parser context
> + * @prep:  optional character to be prepended to text, 0 if no character
> + *         shall be prepended
>   *
>   * parse a CharData section.
>   * if we are within a CDATA section ']]>' marks an end of section.
> @@ -2958,12 +2960,15 @@
>   */
>  
>  static void
> -htmlParseCharData(htmlParserCtxtPtr ctxt) {
> -    xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 5];
> +htmlParseCharDataInternal(htmlParserCtxtPtr ctxt, char prep) {
> +    xmlChar buf[HTML_PARSER_BIG_BUFFER_SIZE + 6];
>      int nbchar = 0;
>      int cur, l;
>      int chunk = 0;
>  
> +    if (prep)
> +     buf[nbchar++] = prep;
> +
>      SHRINK;
>      cur = CUR_CHAR(l);
>      while (((cur != '<') || (ctxt->token == '<')) &&
> @@ -3043,6 +3048,21 @@
>  }
>  
>  /**
> + * htmlParseCharData:
> + * @ctxt:  an HTML parser context
> + *
> + * parse a CharData section.
> + * if we are within a CDATA section ']]>' marks an end of section.
> + *
> + * [14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)
> + */
> +
> +static void
> +htmlParseCharData(htmlParserCtxtPtr ctxt) {
> +    htmlParseCharDataInternal(ctxt, 0);
> +}
> +
> +/**
>   * htmlParseExternalID:
>   * @ctxt:  an HTML parser context
>   * @publicID:  a xmlChar** receiving PubidLiteral
> @@ -4157,14 +4177,24 @@
>           }
>  
>           /*
> -          * Third case :  a sub-element.
> +          * Third case : (unescaped) stand-alone less-than character.
> +          *              Only if HTML_PARSE_RECOVER option is set.
> +          */
> +         else if (ctxt->recovery && (CUR == '<') &&
> +                  (IS_BLANK_CH(NXT(1)) || (NXT(1) == '='))) {
> +             NEXT;
> +             htmlParseCharDataInternal(ctxt, '<');
> +         }
> +
> +         /*
> +          * Fourth case :  a sub-element.
>            */
>           else if (CUR == '<') {
>               htmlParseElement(ctxt);
>           }
>  
>           /*
> -          * Fourth case : a reference. If if has not been resolved,
> +          * Fifth case : a reference. If if has not been resolved,
>            *    parsing returns it's Name, create the node
>            */
>           else if (CUR == '&') {
> @@ -4172,7 +4202,7 @@
>           }
>  
>           /*
> -          * Fifth case : end of the resource
> +          * Sixth case : end of the resource
>            */
>           else if (CUR == 0) {
>               htmlAutoCloseOnEnd(ctxt);
> @@ -4567,7 +4597,17 @@
>           }
>  
>           /*
> -          * Third case :  a sub-element.
> +          * Third case : (unescaped) stand-alone less-than character.
> +          *              Only if HTML_PARSE_RECOVER option is set.
> +          */
> +         else if (ctxt->recovery && (CUR == '<') &&
> +                  (IS_BLANK_CH(NXT(1)) || (NXT(1) == '='))) {
> +             NEXT;
> +             htmlParseCharDataInternal(ctxt, '<');
> +         }
> +
> +         /*
> +          * Fourth case :  a sub-element.
>            */
>           else if (CUR == '<') {
>               htmlParseElementInternal(ctxt);
> @@ -4578,7 +4618,7 @@
>           }
>  
>           /*
> -          * Fourth case : a reference. If if has not been resolved,
> +          * Fifth case : a reference. If if has not been resolved,
>            *    parsing returns it's Name, create the node
>            */
>           else if (CUR == '&') {
> @@ -4586,7 +4626,7 @@
>           }
>  
>           /*
> -          * Fifth case : end of the resource
> +          * Sixth case : end of the resource
>            */
>           else if (CUR == 0) {
>               htmlAutoCloseOnEnd(ctxt);

> _______________________________________________
> xml mailing list, project page  http://xmlsoft.org/
> [email protected]
> https://mail.gnome.org/mailman/listinfo/xml


-- 
Daniel Veillard      | Open Source and Standards, Red Hat
[email protected]  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] [PATCH] less-than character and HTML parser module

Reply via email to