On 2014-01-05 14:26, Steven D'Aprano wrote:
On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote:
Danny walked you through the XML. Note that he didn't decode the
response. It includes an encoding on the first line:
<?xml version="1.0" encoding="ISO-8859-1" ?>
That surprises me. I thought XML was only valid in UTF-8? Or maybe that
was wishful thinking.
tree = ET.fromstring(response.read())
I believe you were correct the first time.
My experience with all of this has been that in spite of the xml having
been advertised as having been encoded in ISO-8859-1 (which I believe is
synonymous with Latin-1), my script (specifically Python's xml parser:
xml.etree.ElementTree) didn't work until the xml was decoded from
Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet
with some comments mentioning the painful lessons learned:
"""
response = urllib2.urlopen(url_format_str %\
(ip_address, ))
encoding = response.headers.getparam('charset')
info = response.read().decode(encoding)
# <info> comes in as <type 'unicode'>.
n = info.find('\n')
xml = info[n+1:] # Get rid of a header line.
# root = ET.fromstring(xml) # This causes error:
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1'
# in position 456: ordinal not in range(128)
root = ET.fromstring(xml.encode("utf-8"))
"""
In other words, leave it to ElementTree to manage the decoding and
encoding itself. Nice -- I like that solution.
_______________________________________________
Tutor maillist - Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor