Following my previous email... On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote: > Any suggestions as to a better way to handle the problem of encoding in > the following context would be appreciated. The problem arose because > 'Bogota' is spelt with an acute accent on the 'a'.
Eryksun has given the right answer for how to extract the encoding from the webpage's headers. That will help 9 times out of 10. But unfortunately sometimes webpages will lack an encoding header, or they will lie, or the text will be invalid for that encoding. What to do then? Let's start by factoring out the repeated code in your giant for-loop into something more manageable and maintainable: > sp = response.splitlines() > country = city = lat = lon = ip = '' > for item in sp: > if item.startswith(b"Country:"): > try: > country = item[9:].decode('utf-8') > except: > print("Exception raised.") > country = item[9:] > elif item.startswith(b"City:"): > try: > city = item[6:].decode('utf-8') > except: > print("Exception raised.") > city = item[6:] and so on, becomes: encoding = ... # as per Eryksun's email sp = response.splitlines() country = city = lat = lon = ip = '' for item in sp: key, value = item.split(':', 1) key = key.decode(encoding).strip() value = value.decode(encoding).strip() if key == 'Country': country = value elif key == 'City': city = value elif key == 'Latitude': lat = value elif key = "Longitude": lon = value elif key = 'IP': ip = value else: raise ValueError('unknown key "%s" found' % key) return {"Country" : country, "City" : city, "Lat" : lat, "Long" : lon, "IP" : ip } But we can do better than that! encoding = ... # as per Eryksun's email sp = response.splitlines() record = {"Country": None, "City": None, "Latitude": None, "Longitude": None, "IP": None} for item in sp: key, value = item.split(':', 1) key = key.decode(encoding).strip() value = value.decode(encoding).strip() if key in record: record[key] = value else: raise ValueError('unknown key "%s" found' % key) if None in list(record.values()): for key, value in record.items(): if value is None: break raise ValueError('missing key in record: %s' % key) return record This simplifies the code a lot, and adds some error-handling. It may be appropriate for your application to handle missing keys by using some default value, such as an empty string, or some other value that cannot be mistaken for an actual value, say "*missing*". But since I don't know your application's needs, I'm going to leave that up to you. Better to start strict and loosen up later, than start too loose and never realise that errors are occuring. I've also changed the keys "Lat" and "Lon" to "Latitude" and "Longitude". If that's a problem, it's easy to fix. Just before returning the record, change the key: record['Lat'] = record.pop('Latitude') and similar for Longitude. Now that the code is simpler to read and maintain, we can start dealing with the risk that the encoding will be missing or wrong. A missing encoding is easy to handle: just pick a default encoding, and hope it is the right one. UTF-8 is a good choice. (It's the only *correct* choice, everybody should be using UTF-8, but alas they often don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header is missing, and you should be good. How to deal with incorrect encodings? That can happen when the website creator *thinks* they are using a certain encoding, but somehow invalid bytes for that encoding creep into the data. That gives us a few different strategies: (1) The third-party "chardet" module can analyse text and try to guess what encoding it *actually* is, rather than what encoding it claims to be. This is what Firefox and other web browsers do, because there are an awful lot of shitty websites out there. But it's not foolproof, so even if it guesses correctly, you still have to deal with invalid data. (2) By default, the decode method will raise an exception. You can catch the exception and try again with a different encoding: for codec in (encoding, 'utf-8', 'latin-1'): try: key = key.decode(codec) except UnicodeDecodeError: pass else: break Latin-1 should be last, because it has the nice property that it will *always* succeed. That doesn't mean it will give you the right characters, as intended by the person who wrote the website, just that it will always give you *some* characters. They may be completely wrong, in other words "mojibake", but they'll be something. An example of mojibake: py> b = 'Bogotá'.encode('utf-8') py> b.decode('latin-1') 'Bogotá' Perhaps a better way is to use the decode/encode error handler. Instead of just calling the decode method, you can specify what to do when an error occurs: raise an exception, ignore the bad bytes, or replace them with some sort of placeholder. We can see the difference here: py> b = 'Bogotá'.encode('latin-1') py> print(b) b'Bogot\xe1' py> b.decode('utf-8', 'strict') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: unexpected end of data py> b.decode('utf-8', 'ignore') 'Bogot' py> b.decode('utf-8', 'replace') 'Bogot�' My suggestion is to use the 'replace' error handler. Armed with this, you should be able to write good solid code that can handle most encoding-related errors. -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor