On 2022-10-24 13:29:13 +1100, Chris Angelico wrote:
> Parsing ancient HTML files is something Beautiful Soup is normally
> great at. But I've run into a small problem, caused by this sort of
> sloppy HTML:
> 
> from bs4 import BeautifulSoup
> # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
> blob = b"""
> <OL>
> <LI>'THERE sinks the nebulous star we call the Sun,
> <LI>If that hypothesis of theirs be sound,'
[...]
> <LI>Stirring a sudden transport rose and fell.
> </OL>
> """
> soup = BeautifulSoup(blob, "html.parser")
> print(soup)
> 
> 
> On this small snippet, it works acceptably, but puts a large number of
> </li> tags immediately before the </ol>.

Ron has already noted that the lxml and html5 parser do the right thing,
so just for the record:

The HTML fragment above is well-formed and contains a number of li
elements at the same level directly below the ol element, not lots of
nested li elements. The end tag of the li element is optional (except in
XHTML) and li elements don't nest.

        hp

-- 
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | h...@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment: signature.asc
Description: PGP signature

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to