On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > Ron has already noted that the lxml and html5 parser do the right thing, > > so just for the record: > > > > The HTML fragment above is well-formed and contains a number of li > > elements at the same level directly below the ol element, not lots of > > nested li elements. The end tag of the li element is optional (except in > > XHTML) and li elements don't nest. > > That's correct. However, parsing it with html.parser and then > reconstituting it as shown in the example code results in all the > </li> tags coming up right before the </ol>, indicating that the <li> > tags were parsed as deeply nested rather than as siblings.
Yes, I got that. What I wanted to say was that this is indeed a bug in html.parser and not an error (or sloppyness, as you called it) in the input or ambiguity in the HTML standard. > In order to get a successful parse out of this, I need something which > sees them as siblings, Right, but Roel (correct name this time) had already posted that lxml and html5lib parse this correctly, so I saw no need to belabour that point. > which html5lib seems to be doing fine. Whether > it has other issues, I don't know, but I guess I'll find out.... The link somebody posted mentions that it's "very slow". Which may or may not be a problem when you have to parse 9000 files. But if it does implement HTML5 correctly, it should parse any file the same as a modern browser does (maybe excluding quirks mode). hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list