On Wed, Nov 20, 2013 at 10:34 PM, Antoine Pitrou <anto...@python.org> wrote: > On mer., 2013-11-20 at 21:57 +0200, Ezio Melotti wrote: >> Now I'm working on #13633 (Automatically convert character references >> in HTMLParser [1]), and I'm planning to add a convert_charrefs boolean >> flag to the constructors that, when set to True, will automatically >> convert charrefs (e.g. """, """) to the corresponding Unicode >> characters, and avoid calling the handle_charref/handle_entityref >> methods. > > How about a separate StandardHTMLParser class that would have the right > handle_charref / handle_entityref implementations? > (you could also change other behaviours in that class if desired) >
When convert_charrefs is True, handle_charref/handle_entityref are not called at all. This is in part because there's no easy way to tell where an invalid charrefs ends (e.g. if the ';' is missing), so the parser would either have to only find correctly terminated charrefs (but that doesn't allow to handle invalid HTML5 entities) or it will have to apply the HTML5 algorithm (or a subset of it) to find what might be a charref, and then the user will have to do it again to find the corresponding character. So, for example, passing "<p>foo>bar</p>" to the parser currently results in: 1) a call to handle_starttag with "p"; 2) a call to handle_data with "foo"; 3) a call to handle_entitydef with ">"; 4) a call to handle_data with "bar"; 5) a call to handle_endtag with "p"; The user has then to write the code in handle_entitydef to convert ">" to ">" and then do "foo" + ">" + "bar" before getting the content of the paragraph, i.e. "foo>bar". With the proposed patch, the parser gets all the text between tags, and then passes it to html.unescape() to convert all the charrefs according to the HTML5 algorithm, so the example above becomes: 1) a call to handle_starttag with "p"; 2) a call to handle_data with "foo>bar"; 3) a call to handle_endtag with "p"; This also happens in the core of HTMLParser, so in order to create a subclass where charrefs are converted automatically and without the handle_charref/handle_entityref, I would also have to duplicate (or reorganize) lot of code. Also while parsing arbitrary HTML you might or might not get charrefs, so the only use cases left are I can think of are: * preserving the entities -- this can be done by setting convert_charrefs=False and returning what gets passed to handle_charref/handle_entityref or by using html.escape() after the parsing (the output might be different though); * using a different set of charrefs -- xml and html4 are subsets of the html5 charrefs so they are covered, for other sets it's still possible to keep using convert_charrefs=False (and people will have time till 3.5/3.6 to add it before the default changes); (Note that unlike the strict argument/mode, I don't plan to remove convert_charrefs -- only to make it default to True.) Best Regards, Ezio Melotti > Regards > > Antoine. _______________________________________________ python-committers mailing list python-committers@python.org https://mail.python.org/mailman/listinfo/python-committers