Re: [python-committers] The evolution of HTMLParser

Ezio Melotti Wed, 20 Nov 2013 13:56:19 -0800

On Wed, Nov 20, 2013 at 10:34 PM, Antoine Pitrou <anto...@python.org> wrote:
> On mer., 2013-11-20 at 21:57 +0200, Ezio Melotti wrote:
>> Now I'm working on #13633 (Automatically convert character references
>> in HTMLParser [1]), and I'm planning to add a convert_charrefs boolean
>> flag to the constructors that, when set to True, will automatically
>> convert charrefs (e.g. "&quot;", "&#34;") to the corresponding Unicode
>> characters, and avoid calling the handle_charref/handle_entityref
>> methods.
>
> How about a separate StandardHTMLParser class that would have the right
> handle_charref / handle_entityref implementations?
> (you could also change other behaviours in that class if desired)
>


When convert_charrefs is True, handle_charref/handle_entityref are not
called at all.
This is in part because there's no easy way to tell where an invalid
charrefs ends (e.g. if the ';' is missing), so the parser would either
have to only find correctly terminated charrefs (but that doesn't
allow to handle invalid HTML5 entities) or it will have to apply the
HTML5 algorithm (or a subset of it) to find what might be a charref,
and then the user will have to do it again to find the corresponding
character.

So, for example, passing "<p>foo&gt;bar</p>" to the parser currently results in:
  1) a call to handle_starttag with "p";
  2) a call to handle_data with "foo";
  3) a call to handle_entitydef with "&gt;";
  4) a call to handle_data with "bar";
  5) a call to handle_endtag with "p";
The user has then to write the code in handle_entitydef to convert
"&gt;" to ">" and then do "foo" + ">" + "bar" before getting the
content of the paragraph, i.e. "foo>bar".

With the proposed patch, the parser gets all the text between tags,
and then passes it to html.unescape() to convert all the charrefs
according to the HTML5 algorithm, so the example above becomes:
  1) a call to handle_starttag with "p";
  2) a call to handle_data with "foo>bar";
  3) a call to handle_endtag with "p";

This also happens in the core of HTMLParser, so in order to create a
subclass where charrefs are converted automatically and without the
handle_charref/handle_entityref, I would also have to duplicate (or
reorganize) lot of code.

Also while parsing arbitrary HTML you might or might not get charrefs,
so the only use cases left are I can think of are:
  * preserving the entities -- this can be done by setting
convert_charrefs=False and returning what gets passed to
handle_charref/handle_entityref or by using html.escape() after the
parsing (the output might be different though);
  * using a different set of charrefs -- xml and html4 are subsets of
the html5 charrefs so they are covered, for other sets it's still
possible to keep using convert_charrefs=False (and people will have
time till 3.5/3.6 to add it before the default changes);
(Note that unlike the strict argument/mode, I don't plan to remove
convert_charrefs -- only to make it default to True.)

Best Regards,
Ezio Melotti

> Regards
>
> Antoine.
_______________________________________________
python-committers mailing list
python-committers@python.org
https://mail.python.org/mailman/listinfo/python-committers

Re: [python-committers] The evolution of HTMLParser

Reply via email to