[lxml] [newbie] Way to get tree from root?

2021-09-02 Thread codecomplete
Hello, I'm still learning about lxml, and was wondering if there's a way to get the tree from the root to avoid writing the file to disk before re-reading it just for that: INPUTFILE = "input.kml" #get rid of NS with open(INPUTFILE) as reader: content = reader.read() conte

[lxml] [newbie] Different ways to find elements

2021-09-03 Thread codecomplete
Hello, While still learning about lxml and xpath, I'm not clear as to why there are different ways to find elements in a tree: = name = root.xpath('//name') print("xpath/name is ",name[0].text) name=root.findall('.//name') print("findall/name is ",name[0].text) for name in root.ite

[lxml] Re: [newbie] Way to get tree from root?

2021-09-03 Thread codecomplete
Thank you. I strip the namespace because 1) I'm not clear about what namespaces are for, 2) they make it harder to search, and 3) I'm dealing with very simple XML files so it doesn't look like it makes a difference if I strip the ns from the source file.

[lxml] Re: [newbie] Way to get tree from root?

2021-09-04 Thread codecomplete
Here's some code I found to strip namespaces after parsing, without relying on a regex: ``` # Remove namespace prefixes #Source: https://stackoverflow.com/questions/60486563/ tree = et.parse(INPUTFILE) root = tree.getroot() for elem in root.getiterator(): #ValueError: Invalid input tag of

[lxml] Re: [newbie] Different ways to find elements

2021-09-04 Thread codecomplete
Thanks very much! ___ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arch...@mail-archive.com

[lxml] [newbie] lxml adds before each end of line

2022-05-10 Thread codecomplete
Hello, This is a newbie question. While editing HTML files on Windows, ie. line ends with 0D0A, lxml adds before each end of line: #tried different things, to no avail parser = et.HTMLParser(remove_blank_text=True,strip_cdata=False) parser = et.HTMLParser(remove_blank_tex

[lxml] Re: [newbie] lxml adds before each end of line

2022-05-10 Thread codecomplete
As a work-around, I can always remove the offending substring, but it's a kludge. output = str(et.tostring(root, pretty_print=True)).replace(' ', '') print(output) ___ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an

[lxml] Re: [HTML] How to get text of attribute?

2022-05-27 Thread codecomplete
For others' benefit: === description = tree.xpath("//meta[@name='description']/@content") print(description[0]) === ___ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://ma

[lxml] getparent() fails with "AttributeError: 'list' object has no attribute 'getparent' "

2022-05-29 Thread codecomplete
Hello, I'd like to find and replace an element in an HTML file. I can't figure out why getparent() doesn't work as expected: == import lxml.html from lxml.html import builder as E import lxml.etree as et import lxml.etree as et parser = et.HTMLParser(remove_blank_text=True,recover=T

[lxml] Re: getparent() fails with "AttributeError: 'list' object has no attribute 'getparent' "

2022-05-29 Thread codecomplete
Here's template.tmpl: === … === ___ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arc

[lxml] Re: getparent() fails with "AttributeError: 'list' object has no attribute 'getparent' "

2022-05-29 Thread codecomplete
Through trial and error, it looks like xpath() returns an array, even if only one element is found in the tree. This works: HERE = template_tree.xpath('//here') if len(HERE): print("HERE:",HERE) parent = HERE[0].getparent() html_tree = lxml.html.fragment_fromstring("blah",

[lxml] Good way to remove/catch wrong tags?

2023-01-28 Thread codecomplete
Hello, Some columns in a DB have badly formed HTML, to the point BeautifulSoup (lxml?) fails: = #Some records start with 0A soup = BeautifulSoup("\n", 'lxml') #AttributeError: 'NoneType' object has no attribute 'text' print(soup.body.text) = What would be a nice way to s

[lxml] Re: Good way to remove/catch wrong tags?

2023-01-28 Thread codecomplete
As a work-around, if there's only a handful of wrong records, catching the error and fixing the records in the DB does the job: === try: #file.write(soup.body.text) text = soup.body.text except AttributeError as error: file.write(str(error)) __

[lxml] [newbie] Preserving carriage returns when calling soup.body.text?

2023-02-10 Thread codecomplete
Hello, I can't find how to tell lxml/BS to preserve carriage returns in an HTML snippet when calling soup.body.text: After removing 's, it also removes the CRLF that follows. == builder = LXMLTreeBuilderForXML(preserve_whitespace_tags=["body"]) rows = cur.fetchall() for row in rows:

[lxml] Re: [newbie] Preserving carriage returns when calling soup.body.text?

2023-02-10 Thread codecomplete
My mistake, I'm sorry. All the carriage returns were stripped in the input file. BS/lxml weren't to blame. Problem solved. ___ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mai

[lxml] Time-out? Wrong code?

2023-07-09 Thread codecomplete
Hello, I'm no lxml expert, so it could be a newbie error…but the following web scrawler script sometimes breaks (see "BUG") while trying to find the number of provinces/properties, even after two one-second sleeps: == import requests from lxml import html import re import math import ti

[lxml] Re: Time-out? Wrong code?

2023-07-09 Thread codecomplete
Thanks for the idea. I'll add some code to check the input. ___ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-le...@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: arc