Roland Mueller <roland.em0...@googlemail.com> writes: > But isn't bs4 only for SOAP content? > Can bs4 or lxml cope with HTML code that does not comply with XML as the > following fragment? > > <p>A > <p>B > <hr> >
bs4 can do it, but lxml wants correct XML. Jupyter console 6.4.0 Python 3.9.9 (main, Nov 16 2021, 07:21:43) Type 'copyright', 'credits' or 'license' for more information IPython 7.29.0 -- An enhanced Interactive Python. Type '?' for help. In [1]: from bs4 import BeautifulSoup as bs In [2]: soup = bs('<p>A<p>B<hr>') In [3]: soup.p Out[3]: <p>A</p> In [4]: soup.find_all('p') Out[4]: [<p>A</p>, <p>B</p>] In [5]: from lxml import etree In [6]: root = etree.fromstring('<p>A<p>B<hr>') Traceback (most recent call last): File "/opt/local/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3444, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "/var/folders/2l/pdng2d2x18d00m41l6r2ccjr0000gn/T/ipykernel_96220/3376613260.py", line 1, in <module> root = etree.fromstring('<p>A<p>B<hr>') File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError File "<string>", line 1 XMLSyntaxError: Premature end of data in tag hr line 1, line 1, column 13 -- Pieter van Oostrum <pie...@vanoostrum.org> www: http://pieter.vanoostrum.org/ PGP key: [8DAE142BE17999C4] -- https://mail.python.org/mailman/listinfo/python-list