On 12/05/2022 22:19, Adrian Bool wrote:
On 12 May 2022, at 13:08, Gilles <[email protected]> wrote:
1. Is there a way to tell lxml _not_ to add <html><body> and </body></html> when 
inserting the header right after <body>?
You're not loading the header data with the HTMLParser are you??  It is the 
HTMLParser that adds <head> etc.  Just parse the content without a specific 
parser specified...

I tried both, and neither works: Either lxml adds <html><body>, or it doesn't like the data because it's not XML (and not clean HTML either):

=============
parser = et.HTMLParser(encoding='latin1',remove_blank_text=True,recover=True)
#Appends <html><body>…</body></html> to block
content_tree = et.parse("block1.html",parser)

#lxml.etree.XMLSyntaxError: AttValue: " or ' expected, line 2, column 16
content_tree = et.parse("block1.html")

content_root = content_tree.getroot()
=============

But then, I'm not surprised lxml isn't happy if I feed it half-baked HTML.

If you're loading and parsing the header and footer as separate enties; then no, I wouldn't expect that to be possible. Just keep header and footer in their own parent Elements and if necessary have some over-all <div> defined within "wrapper"...

I'll see if I can tweak the HTML so that the header and footer are self-containted.

Thank you.

_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]

Reply via email to