Re: intolerant HTML parser

John Nagle Sat, 06 Feb 2010 11:32:37 -0800

Jim wrote:

I generate some HTML and I want to include in my unit tests a check
for syntax.  So I am looking for a program that will complain at any
syntax irregularities.


I am familiar with Beautiful Soup (use it all the time) but it is
intended to cope with bad syntax.  I just tried feeding
HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
didn't complain.


   Try HTML5lib.

        http://code.google.com/p/html5lib/downloads/list

The syntax for HTML5 has well-defined notions of "correct",
"fixable", and "unparseable".  For example, the common but
incorrect form of HTML comments,

        <- comment ->

is understood.

HTML5lib is slow, though. Sometimes very slow. It's really a referenceimplementation of the spec. There's code like this:


    #Should speed up this check somehow (e.g. move the set to a constant)
            if ((0x0001 <= charAsInt <= 0x0008) or
                (0x000E <= charAsInt <= 0x001F) or
                (0x007F  <= charAsInt <= 0x009F) or
                (0xFDD0  <= charAsInt <= 0xFDEF) or
                charAsInt in frozenset([0x000B, 0xFFFE, 0xFFFF, 0x1FFFE,
                                        0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
                                        0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
                                        0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE,
                                        0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE,
                                        0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE,
                                        0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
                                        0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE,
                                        0xFFFFF, 0x10FFFE, 0x10FFFF])):
                self.tokenQueue.append({"type": tokenTypes["ParseError"],
                                        "data":
                                         "illegal-codepoint-for-numeric-entity",
                                        "datavars": {"charAsInt": charAsInt}})

Every time through the loop (once per character), they build that frozen
set again.


                                John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

Re: intolerant HTML parser

Reply via email to