OK, just for reference, attached is my MWE . Get the ZIP file from
gutenberg.org with
wget https://www.gutenberg.org/files/68047/68047-h.zip
lxml version 4.8, python 3.9 on Ubuntu 20.04 or macOS BigSur
Those are really annoying....
Best, /PA
On Fri, 13 May 2022 at 12:47, Gilles <[email protected]> wrote:
> On 12/05/2022 22:32, Adrian Bool wrote:
>
> On 12 May 2022, at 10:26, Gilles <[email protected]> wrote:
>
> File "src\lxml\parser.pxi", line 652, in lxml.etree._raiseParseError
> OSError: Error reading file* '<html>*
>
>
> Look at the last line above - you're giving parse() a string containing
> XML data which the parse() function is treating as a filename; trying to
> open a file with a name equivalent to your XML content!
>
> If you want to parse an XML string - use et.fromstring() instead.
>
> The StringIO call may be reasonable if your XML didn't exist on disk; but
> if your source data is on disk best to either give parse() the filename
> (but then you get your #13 issue) or pass it a file handle provided by
> open().
>
> Sorry I overlooked the last line. I dumbly supposed that parse() could
> take either a file handle or a string.
> _______________________________________________
> lxml - The Python XML Toolkit mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/lxml.python.org/
> Member address: [email protected]
>
--
Fragen sind nicht da um beantwortet zu werden,
Fragen sind da um gestellt zu werden
Georg Kreisler
Headaches with a Juju log:
unit-basic-16: 09:17:36 WARNING juju.worker.uniter.operation we should run
a leader-deposed hook here, but we can't yet
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import etree, html
from zipfile import ZipFile
def main():
e_parser = etree.HTMLParser(remove_blank_text=True,
strip_cdata=False,
encoding='utf-8'
)
with ZipFile('68047-h.zip','r') as zfile:
with zfile.open('68047-h/68047-h.htm','r') as hfile:
tree = etree.parse(hfile,parser=e_parser)
print(etree.tostring(tree, encoding='utf-8', pretty_print=True).decode())
if __name__ == "__main__":
main()
_______________________________________________
lxml - The Python XML Toolkit mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/lxml.python.org/
Member address: [email protected]