On 2014-01-05 14:26, Steven D'Aprano wrote:
On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:

Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:

    <?xml version="1.0" encoding="ISO-8859-1" ?>

That surprises me. I thought XML was only valid in UTF-8? Or maybe that
was wishful thinking.

        tree = ET.fromstring(response.read())

I believe you were correct the first time.
My experience with all of this has been that in spite of the xml having been advertised as having been encoded in ISO-8859-1 (which I believe is synonymous with Latin-1), my script (specifically Python's xml parser: xml.etree.ElementTree) didn't work until the xml was decoded from Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet with some comments mentioning the painful lessons learned:
"""
    response =  urllib2.urlopen(url_format_str %\
                                   (ip_address, ))
    encoding = response.headers.getparam('charset')
    info = response.read().decode(encoding)
    # <info> comes in as <type 'unicode'>.
    n = info.find('\n')
    xml = info[n+1:]  # Get rid of a header line.
    # root = ET.fromstring(xml) # This causes error:
    # UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
    # in position 456: ordinal not in range(128)
    root = ET.fromstring(xml.encode("utf-8"))
"""



In other words, leave it to ElementTree to manage the decoding and
encoding itself. Nice -- I like that solution.
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to