On Mon, 24 Oct 2022 at 23:22, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > On 2022-10-24 21:56:13 +1100, Chris Angelico wrote: > > On Mon, 24 Oct 2022 at 21:33, Peter J. Holzer <hjp-pyt...@hjp.at> wrote: > > > Ron has already noted that the lxml and html5 parser do the right thing, > > > so just for the record: > > > > > > The HTML fragment above is well-formed and contains a number of li > > > elements at the same level directly below the ol element, not lots of > > > nested li elements. The end tag of the li element is optional (except in > > > XHTML) and li elements don't nest. > > > > That's correct. However, parsing it with html.parser and then > > reconstituting it as shown in the example code results in all the > > </li> tags coming up right before the </ol>, indicating that the <li> > > tags were parsed as deeply nested rather than as siblings. > > Yes, I got that. What I wanted to say was that this is indeed a bug in > html.parser and not an error (or sloppyness, as you called it) in the > input or ambiguity in the HTML standard.
I described the HTML as "sloppy" for a number of reasons, but I was of the understanding that it's generally recommended to have the closing tags. Not that it matters much. > > which html5lib seems to be doing fine. Whether > > it has other issues, I don't know, but I guess I'll find out.... > > The link somebody posted mentions that it's "very slow". Which may or > may not be a problem when you have to parse 9000 files. But if it does > implement HTML5 correctly, it should parse any file the same as a modern > browser does (maybe excluding quirks mode). > Yeah. TBH I think the two-hour run time is primarily dominated by network delays, not parsing time, but if I had a service where people could upload HTML to be parsed, that might affect throughput. For the record, if anyone else is considering html5lib: It is likely "fast enough", even if not fast. Give it a try. (And I know what slow parsing feels like. Parsing a ~100MB file with a decently-fast grammar-based lexer takes a good while. Parsing the same content after it's been converted to JSON? Fast.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list