Re: xml.etree and namespaces -- why?
Jon Ribbens wrote: > That's because you *always* need to know the URI of the namespace, > because that's its only meaningful identifier. If you assume that a > particular namespace always uses the same prefix then your code will be > completely broken. The following two pieces of XML should be understood > identically: > > http://www.inkscape.org/namespaces/inkscape;> > > > and: > > http://www.inkscape.org/namespaces/inkscape;> > > > So you can see why e.get('inkscape:label') cannot possibly work, and why > e.get('{http://www.inkscape.org/namespaces/inkscape}label') makes sense. I get it. It does. > The xml.etree author obviously knew that this was cumbersome, and > hence you can do something like: > > namespaces = {'inkspace': 'http://www.inkscape.org/namespaces/inkscape'} > element = root.find('inkspace:foo', namespaces) > > which will work for both of the above pieces of XML. Makes sense. It forces me to make up my own prefixes which I can then safely use in my code rather than relying on the xml's generator to not change their prefixes. BTW, I only now thought to look at what actually is at Inkscape's namespace URI, and it turns out to be quite a nice explanation of what a namespace is and why it looks like a URL. -- https://mail.python.org/mailman/listinfo/python-list
Re: Yaml.unsafe_load error
On Oct 19, 2022 13:02, Albert-Jan Roskam wrote: Hi, I am trying to create a celery.schedules.crontab object from an external yaml file. I can successfully create an instance from a dummy class "Bar", but the crontab class seems call __setstate__ prior to __init__. I have no idea how to solve this. Any ideas? See code below. Thanks! Albert-Jan # what is the correct way for the next line? >>> yaml.unsafe_load('!!python/object:celery.schedules.crontab\n hour: 3\n minute: 30') Reading the source a bit more, me thinks it might be: yaml.unsafe_load('!!python/object/apply:celery.schedules.crontab\nkwds:\n hour: 3\n minute: 30') I did not yet test this, though. -- https://mail.python.org/mailman/listinfo/python-list
Re: xml.etree and namespaces -- why?
I have no idea why, I used to remove namespaces, following the advice from stackoverflow: https://stackoverflow.com/questions/4255277/lxml-etree-xmlparser-remove-unwanted-namespace _ns_removal_xslt_transform = etree.XSLT(etree.fromstring(''' xmlns:xsl="http://www.w3.org/1999/XSL/Transform;> ''')) xml_doc = _ns_removal_xslt_transform( etree.fromstring(my_xml_data) ) Later on, when I worked with SVG, I used BeautifulSoup. Axy. -- https://mail.python.org/mailman/listinfo/python-list
Re: xml.etree and namespaces -- why?
On 2022-10-19, Robert Latest wrote: > If the XML input has namespaces, tags and attributes with prefixes > in the form prefix:sometag get expanded to {uri}sometag where the > prefix is replaced by the full URI. > > Which means that given an Element e, I cannot directly access its attributes > using e.get() because in order to do that I need to know the URI of the > namespace. That's because you *always* need to know the URI of the namespace, because that's its only meaningful identifier. If you assume that a particular namespace always uses the same prefix then your code will be completely broken. The following two pieces of XML should be understood identically: http://www.inkscape.org/namespaces/inkscape;> and: http://www.inkscape.org/namespaces/inkscape;> So you can see why e.get('inkscape:label') cannot possibly work, and why e.get('{http://www.inkscape.org/namespaces/inkscape}label') makes sense. The xml.etree author obviously knew that this was cumbersome, and hence you can do something like: namespaces = {'inkspace': 'http://www.inkscape.org/namespaces/inkscape'} element = root.find('inkspace:foo', namespaces) which will work for both of the above pieces of XML. But unfortunately as far as I can see nobody's thought about doing the same for attributes rather than tags. -- https://mail.python.org/mailman/listinfo/python-list
Re: xml.etree and namespaces -- why?
I mean, it's worth to look at BeautifulSoup source how do they do that. With BS I work with attributes exactly as you want, and I explicitly tell BS to use lxml parser. Axy. On 19/10/2022 14:25, Robert Latest via Python-list wrote: Hi all, For the impatient: Below the longish text is a fully self-contained Python example that illustrates my problem. I'm struggling to understand xml.etree's handling of namespaces. I'm trying to parse an Inkscape document which uses several namespaces. From etree's documentation: If the XML input has namespaces, tags and attributes with prefixes in the form prefix:sometag get expanded to {uri}sometag where the prefix is replaced by the full URI. Which means that given an Element e, I cannot directly access its attributes using e.get() because in order to do that I need to know the URI of the namespace. So rather than doing this (see example below): label = e.get('inkscape:label') I need to do this: label = e.get('{' + uri_inkscape_namespace + '}label') ...which is the method mentioned in etree's docs: One way to search and explore this XML example is to manually add the URI to every tag or attribute in the xpath of a find() or findall(). [...] A better way to search the namespaced XML example is to create a dictionary with your own prefixes and use those in the search functions. Good idea! Better yet, that dictionary or rather, its reverse, already exists, because etree has used it to unnecessarily mangle the namespaces in the first place. The documentation doesn't mention where it can be found, but we can just use the 'xmlns:' attributes of the root element to rebuild it. Or so I thought, until I found out that etree deletes exactly these attributes before handing the element to the user. I'm really stumped here. Apart from the fact that I think XML is bloated shit anyway and has no place outside HTML, I just don't get the purpose of etree's way of working: 1) Evaluate 'xmlns:' attributes of the element 2) Use that info to replace the existing prefixes by {uri} 3) Realizing that using {uri} prefixes is cumbersome, suggest to the user to build their own prefix -> uri dictionary to undo the effort of doing 1) and 2) 4) ...but witholding exactly the information that existed in the original document by deleting the 'xmlns:' attributes from the tag Why didn't they leave the whole damn thing alone? Keep intact and keep the attribute 'prefix:key' literally as they are. For anyone wanting to use the {uri} prefixes (why would they) they could have thrown in a helper function for the prefix->URI translation. I'm assuming that etree's designers knew what they were doing in order to make my life easier when dealing with XML. Maybe I'm missing the forest for the trees. Can anybody enlighten me? Thanks! self-contained example import xml.etree.ElementTree as ET def test_svg(xml): root = ET.fromstring(xml) for e in root.iter(): print(e.tag) # tags are shown prefixed with {URI} if e.tag.endswith('svg'): # Since namespaces are defined inside the tag, let's use the info # from the 'xmlns:' attributes to undo etree's URI prefixing print('Element :') for k, v in e.items(): print(' %s: %s' % (k, v)) # ...but alas: the 'xmlns:' attributes have been deleted by the parser xml = ''' http://www.inkscape.org/namespaces/inkscape; xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd; xmlns="http://www.w3.org/2000/svg; xmlns:svg="http://www.w3.org/2000/svg;> ''' if __name__ == '__main__': test_svg(xml) -- https://mail.python.org/mailman/listinfo/python-list
xml.etree and namespaces -- why?
Hi all, For the impatient: Below the longish text is a fully self-contained Python example that illustrates my problem. I'm struggling to understand xml.etree's handling of namespaces. I'm trying to parse an Inkscape document which uses several namespaces. From etree's documentation: If the XML input has namespaces, tags and attributes with prefixes in the form prefix:sometag get expanded to {uri}sometag where the prefix is replaced by the full URI. Which means that given an Element e, I cannot directly access its attributes using e.get() because in order to do that I need to know the URI of the namespace. So rather than doing this (see example below): label = e.get('inkscape:label') I need to do this: label = e.get('{' + uri_inkscape_namespace + '}label') ...which is the method mentioned in etree's docs: One way to search and explore this XML example is to manually add the URI to every tag or attribute in the xpath of a find() or findall(). [...] A better way to search the namespaced XML example is to create a dictionary with your own prefixes and use those in the search functions. Good idea! Better yet, that dictionary or rather, its reverse, already exists, because etree has used it to unnecessarily mangle the namespaces in the first place. The documentation doesn't mention where it can be found, but we can just use the 'xmlns:' attributes of the root element to rebuild it. Or so I thought, until I found out that etree deletes exactly these attributes before handing the element to the user. I'm really stumped here. Apart from the fact that I think XML is bloated shit anyway and has no place outside HTML, I just don't get the purpose of etree's way of working: 1) Evaluate 'xmlns:' attributes of the element 2) Use that info to replace the existing prefixes by {uri} 3) Realizing that using {uri} prefixes is cumbersome, suggest to the user to build their own prefix -> uri dictionary to undo the effort of doing 1) and 2) 4) ...but witholding exactly the information that existed in the original document by deleting the 'xmlns:' attributes from the tag Why didn't they leave the whole damn thing alone? Keep intact and keep the attribute 'prefix:key' literally as they are. For anyone wanting to use the {uri} prefixes (why would they) they could have thrown in a helper function for the prefix->URI translation. I'm assuming that etree's designers knew what they were doing in order to make my life easier when dealing with XML. Maybe I'm missing the forest for the trees. Can anybody enlighten me? Thanks! self-contained example import xml.etree.ElementTree as ET def test_svg(xml): root = ET.fromstring(xml) for e in root.iter(): print(e.tag) # tags are shown prefixed with {URI} if e.tag.endswith('svg'): # Since namespaces are defined inside the tag, let's use the info # from the 'xmlns:' attributes to undo etree's URI prefixing print('Element :') for k, v in e.items(): print(' %s: %s' % (k, v)) # ...but alas: the 'xmlns:' attributes have been deleted by the parser xml = ''' http://www.inkscape.org/namespaces/inkscape; xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd; xmlns="http://www.w3.org/2000/svg; xmlns:svg="http://www.w3.org/2000/svg;> ''' if __name__ == '__main__': test_svg(xml) -- https://mail.python.org/mailman/listinfo/python-list
Yaml.unsafe_load error
Hi, I am trying to create a celery.schedules.crontab object from an external yaml file. I can successfully create an instance from a dummy class "Bar", but the crontab class seems call __setstate__ prior to __init__. I have no idea how to solve this. Any ideas? See code below. Thanks! Albert-Jan Python 3.6.8 (default, Nov 16 2020, 16:55:22) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import yaml >>> from celery.schedules import crontab >>> crontab(hour=3, minute=0) >>> yaml.unsafe_load('!!python/name:celery.schedules.crontab') >>> yaml.safe_load('celery.schedules.crontab:\n hour: 3\n minute: 0\n') {'celery.schedules.crontab': {'hour': 3, 'minute': 0}} >>> class Bar: ... def __init__(self, x, y): ... pass ... >>> bar = yaml.unsafe_load('!!python/object:__main__.Bar\n x: 42\n y: 666') >>> bar <__main__.Bar object at 0x7f43b464bb38> >>> bar.x 42 # what is the correct way for the next line? >>> yaml.unsafe_load('!!python/object:celery.schedules.crontab\n hour: 3\n minute: 30') Traceback (most recent call last): File "", line 1, in File "/home/albertjan@mycompany/envs/myenv/lib64/python3.6/site-packages/yaml/__init__.py", line 182, in unsafe_load return load(stream, UnsafeLoader) File "/home/albertjan@mycompany/envs/myenv/lib64/python3.6/site-packages/yaml/__init__.py", line 114, in load return loader.get_single_data() File "/home/albertjan@mycompany/envs/myenv/lib64/python3.6/site-packages/yaml/constructor.py", line 51, in get_single_data return self.construct_document(node) File "/home/albertjan@mycompany/envs/myenv/lib64/python3.6/site-packages/yaml/constructor.py", line 60, in construct_document for dummy in generator: File "/home/albertjan@mycompany/envs/myenv/lib64/python3.6/site-packages/yaml/constructor.py", line 621, in construct_python_object self.set_python_instance_state(instance, state) File "/home/albertjan@mycompany/envs/myenv/lib64/python3.6/site-packages/yaml/constructor.py", line 727, in set_python_instance_state instance, state, unsafe=True) File "/home/albertjan@mycompany/envs/myenv/lib64/python3.6/site-packages/yaml/constructor.py", line 597, in set_python_instance_state instance.__setstate__(state) File "/home/albertjan@mycompany/envs/myenv/lib/python3.6/site-packages/celery/schedules.py", line 541, in __setstate__ super().__init__(**state) TypeError: __init__() got an unexpected keyword argument 'hour' -- https://mail.python.org/mailman/listinfo/python-list