On 12/05/2022 22:19, Adrian Bool wrote:
On 12 May 2022, at 13:08, Gilles <[email protected]> wrote:
1. Is there a way to tell lxml _not_ to add <html><body> and </body></html> when
inserting the header right after <body>?
You're not loading the header data with the HTMLParser are you?? It is the
HTMLParser that adds <head> etc. Just parse the content without a specific
parser specified...
I tried both, and neither works: Either lxml adds <html><body>, or it
doesn't like the data because it's not XML (and not clean HTML either):
=============
parser =
et.HTMLParser(encoding='latin1',remove_blank_text=True,recover=True)
#Appends <html><body>…</body></html> to block
content_tree = et.parse("block1.html",parser)
#lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 16
content_tree = et.parse("block1.html")
content_root = content_tree.getroot()
=============
But then, I'm not surprised lxml isn't happy if I feed it half-baked HTML.
If you're loading and parsing the header and footer as separate
enties; then no, I wouldn't expect that to be possible. Just keep
header and footer in their own parent Elements and if necessary have
some over-all <div> defined within "wrapper"...
I'll see if I can tweak the HTML so that the header and footer are
self-containted.
Thank you.
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]