Re: [Tutor] encoding question
On Sun, Jan 5, 2014 at 5:26 PM, Steven D'Aprano wrote: > On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote: >> >> > > That surprises me. I thought XML was only valid in UTF-8? Or maybe that > was wishful thinking. JSON text SHALL be encoded in Unicode: https://tools.ietf.org/html/rfc4627#section-3 For XML, UTF-8 is recommended by RFC 3023, but not required. Also, the MIME charset takes precedence. Section 8 has examples: https://tools.ietf.org/html/rfc3023#section-8 So I was technically wrong to rely on the XML encoding (they happen to be the same in this case). Instead you can create a parser with the encoding from the header: encoding = response.headers.getparam('charset') parser = ET.XMLParser(encoding=encoding) tree = ET.parse(response, parser) The expat parser (pyexpat) used by Python is limited to ASCII, Latin-1 and Unicode transport encodings. So it's probably better to transcode to UTF-8 as Alex is doing, but then use a custom parser to override the XML encoding: encoding = response.headers.getparam('charset') info = response.read().decode(encoding).encode('utf-8') parser = ET.XMLParser(encoding='utf-8') tree = ET.fromstring(info, parser) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-05 14:26, Steven D'Aprano wrote: On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote: Danny walked you through the XML. Note that he didn't decode the response. It includes an encoding on the first line: That surprises me. I thought XML was only valid in UTF-8? Or maybe that was wishful thinking. tree = ET.fromstring(response.read()) I believe you were correct the first time. My experience with all of this has been that in spite of the xml having been advertised as having been encoded in ISO-8859-1 (which I believe is synonymous with Latin-1), my script (specifically Python's xml parser: xml.etree.ElementTree) didn't work until the xml was decoded from Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet with some comments mentioning the painful lessons learned: """ response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') info = response.read().decode(encoding) # comes in as . n = info.find('\n') xml = info[n+1:] # Get rid of a header line. # root = ET.fromstring(xml) # This causes error: # UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' # in position 456: ordinal not in range(128) root = ET.fromstring(xml.encode("utf-8")) """ In other words, leave it to ElementTree to manage the decoding and encoding itself. Nice -- I like that solution. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote: > Danny walked you through the XML. Note that he didn't decode the > response. It includes an encoding on the first line: > > That surprises me. I thought XML was only valid in UTF-8? Or maybe that was wishful thinking. > tree = ET.fromstring(response.read()) In other words, leave it to ElementTree to manage the decoding and encoding itself. Nice -- I like that solution. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-05 08:02, eryksun wrote: On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider wrote: def ip_info(ip_address): response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') print "'encoding' is '%s'." % (encoding, ) info = unicode(response.read().decode(encoding)) decode() returns a unicode object. n = info.find('\n') print "location of first newline is %s." % (n, ) xml = info[n+1:] print "'xml' is '%s'." % (xml, ) tree = ET.fromstring(xml) root = tree.getroot() # Here's where it blows up!!! print "'root' is '%s', with the following children:" % (root, ) for child in root: print child.tag, child.attrib print "END of CHILDREN" return info Danny walked you through the XML. Note that he didn't decode the response. It includes an encoding on the first line: Leave it to ElementTree. Here's something to get you started: import urllib2 import xml.etree.ElementTree as ET import collections url_format_str = 'http://api.hostip.info/?ip=%s&position=true' GML = 'http://www.opengis.net/gml' IPInfo = collections.namedtuple('IPInfo', ''' ip city country latitude longitude ''') def ip_info(ip_address): response = urllib2.urlopen(url_format_str % ip_address) tree = ET.fromstring(response.read()) hostip = tree.find('{%s}featureMember/Hostip' % GML) ip = hostip.find('ip').text city = hostip.find('{%s}name' % GML).text country = hostip.find('countryName').text coord = hostip.find('.//{%s}coordinates' % GML).text lon, lat = coord.split(',') return IPInfo(ip, city, country, lat, lon) >>> info = ip_info('201.234.178.62') >>> info.ip '201.234.178.62' >>> info.city, info.country (u'Bogot\xe1', 'COLOMBIA') >>> info.latitude, info.longitude ('10.4', '-75.2833') This assumes everything works perfect. You have to decide how to fail gracefully for the service being unavailable or malformed XML (incomplete or corrupted response, etc). Thanks again for the input. You're using some ET syntax there that would probably make my code much more readable but will require a bit more study on my part. I was up all night trying to get this sorted out and was finally successful. (Re-) Reading 'joelonsoftware' and some of the Python docs helped. Here's what I came up with (still needs modification to return a dictionary, but that'll be trivial.) alex@x301:~/Python/Parse$ cat ip_xml.py #!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding : utf-8 -*- # file: 'ip_xml.py' import urllib2 import xml.etree.ElementTree as ET url_format_str = \ u'http://api.hostip.info/?ip=%s&position=true' def ip_info(ip_address): response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') info = response.read().decode(encoding) # comes in as . n = info.find('\n') xml = info[n+1:] # Get rid of a header line. # root = ET.fromstring(xml) # This causes error: # UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' # in position 456: ordinal not in range(128) root = ET.fromstring(xml.encode("utf-8")) # This is the part I still don't fully understand but would # probably have to look at the library source to do so. info = [] for i in range(4): info.append(root[3][0][i].text) info.append(root[3][0][4][0][0][0].text) return info if __name__ == "__main__": info = ip_info("201.234.178.62") print info print info[1] alex@x301:~/Python/Parse$ ./ip_xml.py ['201.234.178.62', u'Bogot\xe1', 'COLOMBIA', 'CO', '-75.2833,10.4'] Bogotá Thanks to all who helped. ak ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider wrote: > def ip_info(ip_address): > > response = urllib2.urlopen(url_format_str %\ >(ip_address, )) > encoding = response.headers.getparam('charset') > print "'encoding' is '%s'." % (encoding, ) > info = unicode(response.read().decode(encoding)) decode() returns a unicode object. > n = info.find('\n') > print "location of first newline is %s." % (n, ) > xml = info[n+1:] > print "'xml' is '%s'." % (xml, ) > > tree = ET.fromstring(xml) > root = tree.getroot() # Here's where it blows up!!! > print "'root' is '%s', with the following children:" % (root, ) > for child in root: > print child.tag, child.attrib > print "END of CHILDREN" > return info Danny walked you through the XML. Note that he didn't decode the response. It includes an encoding on the first line: Leave it to ElementTree. Here's something to get you started: import urllib2 import xml.etree.ElementTree as ET import collections url_format_str = 'http://api.hostip.info/?ip=%s&position=true' GML = 'http://www.opengis.net/gml' IPInfo = collections.namedtuple('IPInfo', ''' ip city country latitude longitude ''') def ip_info(ip_address): response = urllib2.urlopen(url_format_str % ip_address) tree = ET.fromstring(response.read()) hostip = tree.find('{%s}featureMember/Hostip' % GML) ip = hostip.find('ip').text city = hostip.find('{%s}name' % GML).text country = hostip.find('countryName').text coord = hostip.find('.//{%s}coordinates' % GML).text lon, lat = coord.split(',') return IPInfo(ip, city, country, lat, lon) >>> info = ip_info('201.234.178.62') >>> info.ip '201.234.178.62' >>> info.city, info.country (u'Bogot\xe1', 'COLOMBIA') >>> info.latitude, info.longitude ('10.4', '-75.2833') This assumes everything works perfect. You have to decide how to fail gracefully for the service being unavailable or malformed XML (incomplete or corrupted response, etc). ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 04, 2014 at 11:57:20PM -0800, Alex Kleider wrote: > Well, I've tried the xml approach which seems promising but still I get > an encoding related error. > Is there a bug in the xml.etree module (not very likely, me thinks) or > am I doing something wrong? I'm no expert on XML, but it looks to me like it is a bug in ElementTree. It doesn't appear to handle unicode strings correctly (although perhaps it doesn't promise to). A simple demonstration using Python 2.7: py> import xml.etree.ElementTree as ET py> ET.fromstring(u'a') But: py> ET.fromstring(u'á') Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/xml/etree/ElementTree.py", line 1282, in XML parser.feed(text) File "/usr/local/lib/python2.7/xml/etree/ElementTree.py", line 1622, in feed self._parser.Parse(data, 0) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 5: ordinal not in range(128) An easy work-around: py> ET.fromstring(u'á'.encode('utf-8')) although, as I said, I'm no expert on XML and this may lead to errors later on. > There's no denying that the whole encoding issue is still not completely > clear to me in spite of having devoted a lot of time to trying to grasp > all that's involved. Have you read Joel On Software's explanation? http://www.joelonsoftware.com/articles/Unicode.html It's well worth reading. Start with that, and then ask if you have any further questions. > Here's what I've got: > > alex@x301:~/Python/Parse$ cat ip_xml.py > #!/usr/bin/env python > # -*- coding : utf -8 -*- > # file: 'ip_xml.py' [...] > tree = ET.fromstring(xml) > root = tree.getroot() # Here's where it blows up!!! I reckon that what you need is to change the first line to: tree = ET.fromstring(xml.encode('latin-1')) or whatever the encoding is meant to be. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 05/01/2014 02:31, Alex Kleider wrote: I've been maintaining both a Python3 and a Python2.7 version. The latter has actually opened my eyes to more complexities. Specifically the need to use unicode strings rather than Python2.7's default ascii. This might help http://python-future.org/ -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 01/05/2014 08:57 AM, Alex Kleider wrote: On 2014-01-04 21:20, Danny Yoo wrote: Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. Well, I've tried the xml approach which seems promising but still I get an encoding related error. .org/mailman/listinfo/tutor Note that the (computing) data description format (JSON, XML...) and the textual format, or "encoding" (Unicode utf8/16/32, legacy iso-8859-* also called latin-*, ...) are more or less unrelated and independant. Changing the data description format cannot solve a text encoding issue (but it may hide it, if by chance the new data description format happened to use the text encoding you happen to use when reading, implicitely or explicitely). Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 01/05/2014 03:31 AM, Alex Kleider wrote: I've been maintaining both a Python3 and a Python2.7 version. The latter has actually opened my eyes to more complexities. Specifically the need to use unicode strings rather than Python2.7's default ascii. So-called Unicode strings are not the solution to all problems. Example with your 'á', which can be represented by either 1 "precomposed" code (unicode code point) 0xe1, or ibasically by 2 ucodes (one for the "base" 'a', one for the "combining" '´'). Imagine you search for "Bogotá": how do you know which is reprsentation is used in the text you search? How do you know at all there are multiple representations, and what they are? The routine wil work iff, by chance, your *programming editor* (!) used the same representation as the software used to create the searched test... Usually it the case, because most text-creation software use precomposed codes, when they exist, for composite characters. (But this fact just makes the issue more rare, hard to be aware of, and thus difficult to cope with correctly in code. As far as I know nearly no software does it.) Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 01/04/2014 08:26 PM, Alex Kleider wrote: Any suggestions as to a better way to handle the problem of encoding in the following context would be appreciated. The problem arose because 'Bogota' is spelt with an acute accent on the 'a'. $ cat IP_info.py3 #!/usr/bin/env python3 # -*- coding : utf -8 -*- # file: 'IP_info.py3' a module. import urllib.request url_format_str = \ 'http://api.hostip.info/get_html.php?ip=%s&position=true' def ip_info(ip_address): """ Returns a dictionary keyed by Country, City, Lat, Long and IP. Depends on http://api.hostip.info (which returns the following: 'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude: 38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.) THIS COULD BREAK IF THE WEB SITE GOES AWAY!!! """ response = urllib.request.urlopen(url_format_str %\ (ip_address, )).read() sp = response.splitlines() country = city = lat = lon = ip = '' for item in sp: if item.startswith(b"Country:"): try: country = item[9:].decode('utf-8') except: print("Exception raised.") country = item[9:] elif item.startswith(b"City:"): try: city = item[6:].decode('utf-8') except: print("Exception raised.") city = item[6:] elif item.startswith(b"Latitude:"): try: lat = item[10:].decode('utf-8') except: print("Exception raised.") lat = item[10] elif item.startswith(b"Longitude:"): try: lon = item[11:].decode('utf-8') except: print("Exception raised.") lon = item[11] elif item.startswith(b"IP:"): try: ip = item[4:].decode('utf-8') except: print("Exception raised.") ip = item[4:] return {"Country" : country, "City" : city, "Lat" : lat, "Long" : lon, "IP" : ip} if __name__ == "__main__": addr = "201.234.178.62" print ("""IP address is %(IP)s: Country: %(Country)s; City: %(City)s. Lat/Long: %(Lat)s/%(Long)s""" % ip_info(addr)) """ The output I get on an Ubuntu 12.4LTS system is as follows: alex@x301:~/Python/Parse$ ./IP_info.py3 Exception raised. IP address is 201.234.178.62: Country: COLOMBIA (CO); City: b'Bogot\xe1'. Lat/Long: 10.4/-75.2833 I would have thought that utf-8 could handle the 'a-acute'. Thanks, alex 'á' does not encode to 0xe1 in utf8 encoding; what you read is probably (legacy) files in probably latin-1 (or another latin-* encoding). Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 01/05/2014 12:52 AM, Steven D'Aprano wrote: If you don't understand an exception, you have no business covering it up and hiding that it took place. Never use a bare try...except, always catch the *smallest* number of specific exception types that make sense. Better is to avoid catching exceptions at all: an exception (usually) means something has gone wrong. You should aim to fix the problem *before* it blows up, not after. I'm reminded of a quote: "I find it amusing when novice programmers believe their main job is preventing programs from crashing. ... More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn't crash is a horrible nightmare." -- Chris Smith Your code is incorrect, it does the wrong thing, but it doesn't crash, it just covers up the fact that an exception occured. An exception, or any other kind of anomaly detected by a func one calls, is in most cases a *symptom* of an error, somewhere else in one's code (possibly far in source, possibly long earlier, possibly apparently unrelated). Catching an exception (except in rare cases), is just suppressing a _signal_ about a probable error. Catching an exception does not make the code correct, it just pretends to (except in rare cases). It's like hiding the dirt under a carpet, or beating up the poor guy that ran for 3 kilometers to tell you a fire in threatening your home. Again: the anomaly (eg wrong input) detected by a func is not the error; it is a consequence of the true original error, what one should aim at correcting. (But our culture apparently loves repressing symptoms rather than curing actual problems: we programmers just often thoughtlessly apply the scheme ;-) We should instead gratefully thank func authors for having correctly done their jobs of controlling input. They offer us the information needed to find bugs which otherwise may happily go on their lives undetected; and thus the opportunity to write more correct software. (This is why func authors should control input, refuse any anomalous or dubious values, and never ever try to guess what the app expects in such cases; instead just say "cannot do my job safely, or at all".) If one is passing an empty set to an 'average' func, don't blame the func or shut up the signal/exception, instead be grateful to the func's author, and find why and how it happens the set is empty. If one is is trying to write into a file, don't blame the file for not existing, the user for being stupid, or shut up the signal/exception, instead be grateful to the func's author, and find why and how it happens the file does not exist, now (about the user: is your doc clear enough?). The sub-category of cases where exception handling makes sense at all is the following: * a called function may fail (eg average, find a given item in a list, write into a file) * and, the failure case makes sense for the app, it _does_ belong to the app logic * and, the case should nevertheless be handled like others up to this point in code (meaning, there should not be a separate branch for it, we should really land there in code even for this failure case) * and, one cannot know whether it is a failure case without trying, or it would be as costly as just trying (wrong for average, right for 2 other examples) * and, one can repair the failure right here, in any case, and go on correctly according to the app logic (depends on apps) (there is also the category of alternate running modes) In such a situation, the right thing to do is to catch the exception signal (or use whatever error management exists, eg a check for a None return value) and proceed correctly (and think at testing this case ;-). But this is not that common. In particular, if the failure case does not belong to the app logic (the item should be there, the file should exist) then do *not* catch a potential signal: if it happens, it would tell you about a bug *elsewhere* in code; and _this_ is what is to correct. There a mythology in programming, that software should not crash; wrongly understood (or rightly, authors of such texts usually are pretty unclear and ambiguous), this leads to catching exceptions that are just signal of symptoms of errors... Instead, software should crash whenever it is incorrect; often (when the error does not cause obvious misbehaviour) it is the only way for the programmer to know about errors. Crashes are the programmer's best friend (I mean, those programmers which aim is to write quality software). Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 21:20, Danny Yoo wrote: Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. Well, I've tried the xml approach which seems promising but still I get an encoding related error. Is there a bug in the xml.etree module (not very likely, me thinks) or am I doing something wrong? There's no denying that the whole encoding issue is still not completely clear to me in spite of having devoted a lot of time to trying to grasp all that's involved. Here's what I've got: alex@x301:~/Python/Parse$ cat ip_xml.py #!/usr/bin/env python # -*- coding : utf -8 -*- # file: 'ip_xml.py' import urllib2 import xml.etree.ElementTree as ET url_format_str = \ u'http://api.hostip.info/?ip=%s&position=true' def ip_info(ip_address): response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') print "'encoding' is '%s'." % (encoding, ) info = unicode(response.read().decode(encoding)) n = info.find('\n') print "location of first newline is %s." % (n, ) xml = info[n+1:] print "'xml' is '%s'." % (xml, ) tree = ET.fromstring(xml) root = tree.getroot() # Here's where it blows up!!! print "'root' is '%s', with the following children:" % (root, ) for child in root: print child.tag, child.attrib print "END of CHILDREN" return info if __name__ == "__main__": info = ip_info("201.234.178.62") alex@x301:~/Python/Parse$ ./ip_xml.py 'encoding' is 'iso-8859-1'. location of first newline is 44. 'xml' is 'xmlns:gml="http://www.opengis.net/gml"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:noNamespaceSchemaLocation="http://www.hostip.info/api/hostip-1.0.1.xsd";> This is the Hostip Lookup Service hostip inapplicable 201.234.178.62 Bogotá COLOMBIA CO http://www.opengis.net/gml/srs/epsg.xml#4326";> -75.2833,10.4 '. Traceback (most recent call last): File "./ip_xml.py", line 33, in info = ip_info("201.234.178.62") File "./ip_xml.py", line 23, in ip_info tree = ET.fromstring(xml) File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1301, in XML parser.feed(text) File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1641, in feed self._parser.Parse(data, 0) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 456: ordinal not in range(128) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 21:20, Danny Yoo wrote: Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. But I truly dislike XML for being implemented in ways that are usually not fun to navigate: either the APIs or the encoded data are usually convoluted enough to make it a chore rather than a pleasure. The beginning does look similar: ## import xml.etree.ElementTree as ET import urllib response = urllib.urlopen("http://api.hostip.info?ip=201.234.178.62&position=true";) tree = ET.parse(response) tree ## Up to this point, not so bad. But this is where it starts to look silly: ## tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text '201.234.178.62' tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text u'Bogot\xe1' ## where we need to deal with XML namespaces, an extra complexity for a benefit that I have never bought into. More than that, usually the XML I run into in practice isn't even properly structured, as is the case with the lat-long value in the XML output here: ## tree.find('.//{http://www.opengis.net/gml}coordinates').text '-75.2833,10.4' ## which is truly silly. Why is the latitude and longitude not two separate, structured values? What is this XML buying us here, really then? I'm convinced that all the extraneous structure and complexity in XML causes the people who work with it to stop caring, the result being something that isn't for the benefit of either humans nor computer programs. Hence, that's why I prefer JSON: JSON export is usually a lot more sensible, for reasons that I can speculate on, but I probably should stop this rant. :P Not a rant at all. As it turns out, one of the other things that have interested me of late is docbook, an xml dialect (I think this is the correct way to express it.) I've found it very useful and so do not share your distaste for xml although one can't disagree with the points you've made with regard to xml as a solution to the problem under discussion. I've not played with the python xml interfaces before so this will be a good project for me. Thanks. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
> then? I'm convinced that all the extraneous structure and complexity > in XML causes the people who work with it to stop caring, the result > being something that isn't for the benefit of either humans nor > computer programs. ... I'm sorry. Sometimes I get grumpy when I haven't had a Snickers. I should not have said the above here. It isn't factual, and worse, it insinuates an uncharitable intent to people who I do not know. There's enough insinuation and insults out there in the world already: I should not be contributing to those things. For that, I apologize. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. But I truly dislike XML for being implemented in ways that are usually not fun to navigate: either the APIs or the encoded data are usually convoluted enough to make it a chore rather than a pleasure. The beginning does look similar: ## >>> import xml.etree.ElementTree as ET >>> import urllib >>> response = >>> urllib.urlopen("http://api.hostip.info?ip=201.234.178.62&position=true";) >>> tree = ET.parse(response) >>> tree ## Up to this point, not so bad. But this is where it starts to look silly: ## >>> tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text '201.234.178.62' >>> tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text u'Bogot\xe1' ## where we need to deal with XML namespaces, an extra complexity for a benefit that I have never bought into. More than that, usually the XML I run into in practice isn't even properly structured, as is the case with the lat-long value in the XML output here: ## >>> tree.find('.//{http://www.opengis.net/gml}coordinates').text '-75.2833,10.4' ## which is truly silly. Why is the latitude and longitude not two separate, structured values? What is this XML buying us here, really then? I'm convinced that all the extraneous structure and complexity in XML causes the people who work with it to stop caring, the result being something that isn't for the benefit of either humans nor computer programs. Hence, that's why I prefer JSON: JSON export is usually a lot more sensible, for reasons that I can speculate on, but I probably should stop this rant. :P ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 4, 2014 at 11:16 PM, Alex Kleider wrote: > {u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', u'country_code': > u'CO', u'country_name': u'COLOMBIA', u'lng': u'-75.2833'} > > If I use my own IP the city comes in fine so there must still be some > problem with the encoding. Report a bug in their JSON API. It's returning b'"city":null'. I see the same problem for www.msj.go.cr in San José, Costa Rica. It's probably broken for all non-ASCII byte strings. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 18:44, Danny Yoo wrote: Hi Alex, According to: http://www.hostip.info/use.html there is a JSON-based interface. I'd recommend using that one! JSON is a format that's easy for machines to decode. The format you're parsing is primarily for humans, and who knows if that will change in the future to make it easier to read? Not only is JSON probably more reliable to parse, but the code itself should be fairly straightforward. For example: # ## In Python 2.7 ## import json import urllib response = urllib.urlopen('http://api.hostip.info/get_json.php') info = json.load(response) info {u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA', u'country_code': u'US', u'ip': u'216.239.45.81'} # This strikes me as being the most elegant solution to date, and I thank you for it! The problem is that the city name doesn't come in: alex@x301:~/Python/Parse$ cat tutor.py #!/usr/bin/env python # -*- coding : utf -8 -*- # file: 'tutor.py' """ Put your docstring here. """ print "Running 'tutor.py'..." import json import urllib response = urllib.urlopen\ ('http://api.hostip.info/get_json.php?ip=201.234.178.62&position=true') info = json.load(response) print info alex@x301:~/Python/Parse$ ./tutor.py Running 'tutor.py'... {u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', u'country_code': u'CO', u'country_name': u'COLOMBIA', u'lng': u'-75.2833'} If I use my own IP the city comes in fine so there must still be some problem with the encoding. should I be using encoding = response.headers.getparam('charset') in there somewhere? Any ideas? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, 04 Jan 2014 18:31:13 -0800, Alex Kleider wrote: exactly what the line # -*- coding : utf -8 -*- really indicates or more importantly, is it true, since I am using vim and I assume things are encoded as ascii? I don't know vim specifically, but I'm 99% sure it will let you specify the encoding,. Certainly emacs does, so I'd not expect vim to fall behind on such a fundamental point. Anyway it's also likely that it defaults to utf for new files. Anyway your job is to make sure that the encoding line matches what the editor is using. Emacs also looks in the first few lines for that same encoding line, so if you format it carefully, it'll just work. Easy to test anyway for yourself. Just paste some international characters into a literal string. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
You were asking earlier about the line: # -*- coding : utf -8 -*- See PEP 263: http://www.python.org/dev/peps/pep-0263/ http://docs.python.org/release/2.3/whatsnew/section-encodings.html It's a line that tells Python how to interpret the bytes of your source program. It allows us to write unicode literal strings embedded directly in the program source itself. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
Hi Alex, According to: http://www.hostip.info/use.html there is a JSON-based interface. I'd recommend using that one! JSON is a format that's easy for machines to decode. The format you're parsing is primarily for humans, and who knows if that will change in the future to make it easier to read? Not only is JSON probably more reliable to parse, but the code itself should be fairly straightforward. For example: # ## In Python 2.7 ## >>> import json >>> import urllib >>> response = urllib.urlopen('http://api.hostip.info/get_json.php') >>> info = json.load(response) >>> info {u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA', u'country_code': u'US', u'ip': u'216.239.45.81'} # Best of wishes! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
A heartfelt thank you to those of you that have given me much to ponder with your helpful responses. In the mean time I've rewritten my procedure using a different approach all together. I'd be interested in knowing if you think it's worth keeping or do you suggest I use your revisions to my original hack? I've been maintaining both a Python3 and a Python2.7 version. The latter has actually opened my eyes to more complexities. Specifically the need to use unicode strings rather than Python2.7's default ascii. Here it is: alex@x301:~/Python/Parse$ cat ip_info.py #!/usr/bin/env python # -*- coding : utf -8 -*- import re import urllib2 url_format_str = \ u'http://api.hostip.info/get_html.php?ip=%s&position=true' info_exp = r""" Country:[ ](?P.*) [\n] City:[ ](?P.*) [\n] [\n] Latitude:[ ](?P.*) [\n] Longitude:[ ](?P.*) [\n] IP:[ ](?P.*) """ info_pattern = re.compile(info_exp, re.VERBOSE).search def ip_info(ip_address): """ Returns a dictionary keyed by Country, City, Lat, Long and IP. Depends on http://api.hostip.info (which returns the following: 'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude: 38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.) THIS COULD BREAK IF THE WEB SITE GOES AWAY!!! """ response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') info = info_pattern(response.read().decode(encoding)) return {"Country" : unicode(info.group("country")), "City" : unicode(info.group("city")), "Lat" : unicode(info.group("lat")), "Lon" : unicode(info.group("lon")), "IP" : unicode(info.group("ip"))} if __name__ == "__main__": print """IP address is %(IP)s: Country: %(Country)s; City: %(City)s. Lat/Long: %(Lat)s/%(Lon)s""" % ip_info("201.234.178.62") Apart from soliciting your general comments, I'm also interested to know exactly what the line # -*- coding : utf -8 -*- really indicates or more importantly, is it true, since I am using vim and I assume things are encoded as ascii? I've discovered that with Ubuntu it's very easy to switch from English (US) to English (US, international with dead keys) with just two clicks so thanks for that tip as well. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 04, 2014 at 04:15:30PM -0800, Alex Kleider wrote: > >py> 'Bogotá'.encode('utf-8') > > I'm interested in knowing how you were able to enter the above line > (assuming you have a key board similar to mine.) I'm running Linux, and I use the KDE or Gnome character selector, depending on which computer I'm using. They give you a graphical window showing a screenful of characters at a time, depending on which application I'm using you can search for characters by name or property, then copy them into the clipboard to paste them into another application. I can also use the "compose" key. My keyboard doesn't have an actual key labelled compose, but my system is set to use the right-hand Windows key (between Alt and the menu key) as the compose key. (Why the left-hand Windows key isn't set to do the same thing is a mystery to me.) So if I type: 'a I get á. The problem with the compose key is that it's not terribly intuitive. Sure, a few of them are: 1 2 gives ½ but how do I get π (pi)? p doesn't work. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
Following my previous email... On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote: > Any suggestions as to a better way to handle the problem of encoding in > the following context would be appreciated. The problem arose because > 'Bogota' is spelt with an acute accent on the 'a'. Eryksun has given the right answer for how to extract the encoding from the webpage's headers. That will help 9 times out of 10. But unfortunately sometimes webpages will lack an encoding header, or they will lie, or the text will be invalid for that encoding. What to do then? Let's start by factoring out the repeated code in your giant for-loop into something more manageable and maintainable: > sp = response.splitlines() > country = city = lat = lon = ip = '' > for item in sp: > if item.startswith(b"Country:"): > try: > country = item[9:].decode('utf-8') > except: > print("Exception raised.") > country = item[9:] > elif item.startswith(b"City:"): > try: > city = item[6:].decode('utf-8') > except: > print("Exception raised.") > city = item[6:] and so on, becomes: encoding = ... # as per Eryksun's email sp = response.splitlines() country = city = lat = lon = ip = '' for item in sp: key, value = item.split(':', 1) key = key.decode(encoding).strip() value = value.decode(encoding).strip() if key == 'Country': country = value elif key == 'City': city = value elif key == 'Latitude': lat = value elif key = "Longitude": lon = value elif key = 'IP': ip = value else: raise ValueError('unknown key "%s" found' % key) return {"Country" : country, "City" : city, "Lat" : lat, "Long" : lon, "IP" : ip } But we can do better than that! encoding = ... # as per Eryksun's email sp = response.splitlines() record = {"Country": None, "City": None, "Latitude": None, "Longitude": None, "IP": None} for item in sp: key, value = item.split(':', 1) key = key.decode(encoding).strip() value = value.decode(encoding).strip() if key in record: record[key] = value else: raise ValueError('unknown key "%s" found' % key) if None in list(record.values()): for key, value in record.items(): if value is None: break raise ValueError('missing key in record: %s' % key) return record This simplifies the code a lot, and adds some error-handling. It may be appropriate for your application to handle missing keys by using some default value, such as an empty string, or some other value that cannot be mistaken for an actual value, say "*missing*". But since I don't know your application's needs, I'm going to leave that up to you. Better to start strict and loosen up later, than start too loose and never realise that errors are occuring. I've also changed the keys "Lat" and "Lon" to "Latitude" and "Longitude". If that's a problem, it's easy to fix. Just before returning the record, change the key: record['Lat'] = record.pop('Latitude') and similar for Longitude. Now that the code is simpler to read and maintain, we can start dealing with the risk that the encoding will be missing or wrong. A missing encoding is easy to handle: just pick a default encoding, and hope it is the right one. UTF-8 is a good choice. (It's the only *correct* choice, everybody should be using UTF-8, but alas they often don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header is missing, and you should be good. How to deal with incorrect encodings? That can happen when the website creator *thinks* they are using a certain encoding, but somehow invalid bytes for that encoding creep into the data. That gives us a few different strategies: (1) The third-party "chardet" module can analyse text and try to guess what encoding it *actually* is, rather than what encoding it claims to be. This is what Firefox and other web browsers do, because there are an awful lot of shitty websites out there. But it's not foolproof, so even if it guesses correctly, you still have to deal with invalid data. (2) By default, the decode method will raise an exception. You can catch the exception and try again with a different encoding: for codec in (encoding, 'utf-8', 'latin-1'): try: key = key.decode(codec) except UnicodeDecodeError: pass else: break Latin-1 should be last, because it has the nice property that it will *always* succeed. That doesn't mean it will give you the right characters, as intended by the person who wrote the website, just that it will always give
Re: [Tutor] encoding question
On Sat, Jan 4, 2014 at 7:15 PM, Alex Kleider wrote: >> >> py> 'Bogotá'.encode('utf-8') > > I'm interested in knowing how you were able to enter the above line > (assuming you have a key board similar to mine.) I use an international keyboard layout: https://en.wikipedia.org/wiki/QWERTY#US-International One could also copy and paste from a printed literal: >>> 'Bogot\xe1' 'Bogotá' Or more verbosely: >>> 'Bogot\N{latin small letter a with acute}' 'Bogotá' ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 15:52, Steven D'Aprano wrote: Oh great. An exception was raised. What sort of exception? What error message did it have? Why did it happen? Nobody knows, because you throw it away. Never, never, never do this. If you don't understand an exception, you have no business covering it up and hiding that it took place. Never use a bare try...except, always catch the *smallest* number of specific exception types that make sense. Better is to avoid catching exceptions at all: an exception (usually) means something has gone wrong. You should aim to fix the problem *before* it blows up, not after. I'm reminded of a quote: "I find it amusing when novice programmers believe their main job is preventing programs from crashing. ... More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn't crash is a horrible nightmare." -- Chris Smith Your code is incorrect, it does the wrong thing, but it doesn't crash, it just covers up the fact that an exception occured. The output I get on an Ubuntu 12.4LTS system is as follows: alex@x301:~/Python/Parse$ ./IP_info.py3 Exception raised. IP address is 201.234.178.62: Country: COLOMBIA (CO); City: b'Bogot\xe1'. Lat/Long: 10.4/-75.2833 I would have thought that utf-8 could handle the 'a-acute'. Of course it can: py> 'Bogotá'.encode('utf-8') I'm interested in knowing how you were able to enter the above line (assuming you have a key board similar to mine.) b'Bogot\xc3\xa1' py> b'Bogot\xc3\xa1'.decode('utf-8') 'Bogotá' But you don't have UTF-8. You have something else, and trying to decode it using UTF-8 fails. py> b'Bogot\xe1'.decode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: unexpected end of data More to follow... I very much agree with your remarks. In a pathetic attempt at self defence I just want to mention that what I presented wasn't what I thought was a solution. Rather it was an attempt to figure out what the problem was as a preliminary step to fixing it. With help from you and others, I was successful in doing this. And for that help, I thank all list participants very much. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote: > Any suggestions as to a better way to handle the problem of encoding in > the following context would be appreciated. Python gives you lots of useful information when errors occur, but unfortunately your code throws that information away and replaces it with a totally useless message: > try: > country = item[9:].decode('utf-8') > except: > print("Exception raised.") Oh great. An exception was raised. What sort of exception? What error message did it have? Why did it happen? Nobody knows, because you throw it away. Never, never, never do this. If you don't understand an exception, you have no business covering it up and hiding that it took place. Never use a bare try...except, always catch the *smallest* number of specific exception types that make sense. Better is to avoid catching exceptions at all: an exception (usually) means something has gone wrong. You should aim to fix the problem *before* it blows up, not after. I'm reminded of a quote: "I find it amusing when novice programmers believe their main job is preventing programs from crashing. ... More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn't crash is a horrible nightmare." -- Chris Smith Your code is incorrect, it does the wrong thing, but it doesn't crash, it just covers up the fact that an exception occured. > The output I get on an Ubuntu 12.4LTS system is as follows: > alex@x301:~/Python/Parse$ ./IP_info.py3 > Exception raised. > IP address is 201.234.178.62: > Country: COLOMBIA (CO); City: b'Bogot\xe1'. > Lat/Long: 10.4/-75.2833 > > > I would have thought that utf-8 could handle the 'a-acute'. Of course it can: py> 'Bogotá'.encode('utf-8') b'Bogot\xc3\xa1' py> b'Bogot\xc3\xa1'.decode('utf-8') 'Bogotá' But you don't have UTF-8. You have something else, and trying to decode it using UTF-8 fails. py> b'Bogot\xe1'.decode('utf-8') Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: unexpected end of data More to follow... -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 12:01, eryksun wrote: On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider wrote: . b'\xe1' is Latin-1. Look in the response headers: url = 'http://api.hostip.info/get_html.php?ip=201.234.178.62&position=true' >>> response = urllib.request.urlopen(url) >>> response.headers.get_charsets() ['iso-8859-1'] >>> encoding = response.headers.get_charsets()[0] >>> sp = response.read().decode(encoding).splitlines() >>> sp[1] 'City: Bogotá' Thank you very much. Now things are more clear. cheers, alex ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider wrote: > The output I get on an Ubuntu 12.4LTS system is as follows: > alex@x301:~/Python/Parse$ ./IP_info.py3 > Exception raised. > IP address is 201.234.178.62: > Country: COLOMBIA (CO); City: b'Bogot\xe1'. > Lat/Long: 10.4/-75.2833 > > > I would have thought that utf-8 could handle the 'a-acute'. b'\xe1' is Latin-1. Look in the response headers: url = 'http://api.hostip.info/get_html.php?ip=201.234.178.62&position=true' >>> response = urllib.request.urlopen(url) >>> response.headers.get_charsets() ['iso-8859-1'] >>> encoding = response.headers.get_charsets()[0] >>> sp = response.read().decode(encoding).splitlines() >>> sp[1] 'City: Bogotá' ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding question
On Wed, Sep 9, 2009 at 5:06 AM, Oleg Oltar wrote: > Hi! > > One of my tests returned following text () > > The test: > from django.test.client import Client > c = Client() > resp = c.get("/") > resp.content > > In [25]: resp.content > Out[25]: '\r\n\r\n\r\n Strict//EN" > "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>\r\n\r\n xmlns="http://www.w3.org/1999/xhtml";>\r\n \r\n http-equiv="content-type" content="text/html; charset=utf-8" />\r\n > \r\n \nJapanese innovation | > \xd0\xaf\xd0\xbf\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x8f > \xd0\xb8\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8\n\r\n > Is there a way I can convert it to normal readable text? (I need for example > to find a string of text in this response to check if my test case Pass or > failed) resp.content.decode('string_escape') will convert it to encoded bytes. Then another decode() with the correct encoding will get you Unicode. I'm not sure what the correct encoding is for the second decode(), most likely one of 'utf-8', 'utf_16_le' or 'utf_16_be'. Kent ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor