Jim wrote:
I generate some HTML and I want to include in my unit tests a check
for syntax. So I am looking for a program that will complain at any
syntax irregularities.
I am familiar with Beautiful Soup (use it all the time) but it is
intended to cope with bad syntax. I just tried feeding
HTMLParser.HTMLParser some HTML containing '<p>a<b>b</p></b>' and it
didn't complain.
Try HTML5lib.
http://code.google.com/p/html5lib/downloads/list
The syntax for HTML5 has well-defined notions of "correct",
"fixable", and "unparseable". For example, the common but
incorrect form of HTML comments,
<- comment ->
is understood.
HTML5lib is slow, though. Sometimes very slow. It's really a reference
implementation of the spec. There's code like this:
#Should speed up this check somehow (e.g. move the set to a constant)
if ((0x0001 <= charAsInt <= 0x0008) or
(0x000E <= charAsInt <= 0x001F) or
(0x007F <= charAsInt <= 0x009F) or
(0xFDD0 <= charAsInt <= 0xFDEF) or
charAsInt in frozenset([0x000B, 0xFFFE, 0xFFFF, 0x1FFFE,
0x1FFFF, 0x2FFFE, 0x2FFFF, 0x3FFFE,
0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE,
0x5FFFF, 0x6FFFE, 0x6FFFF, 0x7FFFE,
0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE,
0x9FFFF, 0xAFFFE, 0xAFFFF, 0xBFFFE,
0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE,
0xFFFFF, 0x10FFFE, 0x10FFFF])):
self.tokenQueue.append({"type": tokenTypes["ParseError"],
"data":
"illegal-codepoint-for-numeric-entity",
"datavars": {"charAsInt": charAsInt}})
Every time through the loop (once per character), they build that frozen
set again.
John Nagle
--
http://mail.python.org/mailman/listinfo/python-list