Re: [Tutor] encoding question
On 01/05/2014 12:52 AM, Steven D'Aprano wrote: If you don't understand an exception, you have no business covering it up and hiding that it took place. Never use a bare try...except, always catch the *smallest* number of specific exception types that make sense. Better is to avoid catching exceptions at all: an exception (usually) means something has gone wrong. You should aim to fix the problem *before* it blows up, not after. I'm reminded of a quote: I find it amusing when novice programmers believe their main job is preventing programs from crashing. ... More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn't crash is a horrible nightmare. -- Chris Smith Your code is incorrect, it does the wrong thing, but it doesn't crash, it just covers up the fact that an exception occured. An exception, or any other kind of anomaly detected by a func one calls, is in most cases a *symptom* of an error, somewhere else in one's code (possibly far in source, possibly long earlier, possibly apparently unrelated). Catching an exception (except in rare cases), is just suppressing a _signal_ about a probable error. Catching an exception does not make the code correct, it just pretends to (except in rare cases). It's like hiding the dirt under a carpet, or beating up the poor guy that ran for 3 kilometers to tell you a fire in threatening your home. Again: the anomaly (eg wrong input) detected by a func is not the error; it is a consequence of the true original error, what one should aim at correcting. (But our culture apparently loves repressing symptoms rather than curing actual problems: we programmers just often thoughtlessly apply the scheme ;-) We should instead gratefully thank func authors for having correctly done their jobs of controlling input. They offer us the information needed to find bugs which otherwise may happily go on their lives undetected; and thus the opportunity to write more correct software. (This is why func authors should control input, refuse any anomalous or dubious values, and never ever try to guess what the app expects in such cases; instead just say cannot do my job safely, or at all.) If one is passing an empty set to an 'average' func, don't blame the func or shut up the signal/exception, instead be grateful to the func's author, and find why and how it happens the set is empty. If one is is trying to write into a file, don't blame the file for not existing, the user for being stupid, or shut up the signal/exception, instead be grateful to the func's author, and find why and how it happens the file does not exist, now (about the user: is your doc clear enough?). The sub-category of cases where exception handling makes sense at all is the following: * a called function may fail (eg average, find a given item in a list, write into a file) * and, the failure case makes sense for the app, it _does_ belong to the app logic * and, the case should nevertheless be handled like others up to this point in code (meaning, there should not be a separate branch for it, we should really land there in code even for this failure case) * and, one cannot know whether it is a failure case without trying, or it would be as costly as just trying (wrong for average, right for 2 other examples) * and, one can repair the failure right here, in any case, and go on correctly according to the app logic (depends on apps) (there is also the category of alternate running modes) In such a situation, the right thing to do is to catch the exception signal (or use whatever error management exists, eg a check for a None return value) and proceed correctly (and think at testing this case ;-). But this is not that common. In particular, if the failure case does not belong to the app logic (the item should be there, the file should exist) then do *not* catch a potential signal: if it happens, it would tell you about a bug *elsewhere* in code; and _this_ is what is to correct. There a mythology in programming, that software should not crash; wrongly understood (or rightly, authors of such texts usually are pretty unclear and ambiguous), this leads to catching exceptions that are just signal of symptoms of errors... Instead, software should crash whenever it is incorrect; often (when the error does not cause obvious misbehaviour) it is the only way for the programmer to know about errors. Crashes are the programmer's best friend (I mean, those programmers which aim is to write quality software). Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 01/04/2014 08:26 PM, Alex Kleider wrote: Any suggestions as to a better way to handle the problem of encoding in the following context would be appreciated. The problem arose because 'Bogota' is spelt with an acute accent on the 'a'. $ cat IP_info.py3 #!/usr/bin/env python3 # -*- coding : utf -8 -*- # file: 'IP_info.py3' a module. import urllib.request url_format_str = \ 'http://api.hostip.info/get_html.php?ip=%sposition=true' def ip_info(ip_address): Returns a dictionary keyed by Country, City, Lat, Long and IP. Depends on http://api.hostip.info (which returns the following: 'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude: 38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.) THIS COULD BREAK IF THE WEB SITE GOES AWAY!!! response = urllib.request.urlopen(url_format_str %\ (ip_address, )).read() sp = response.splitlines() country = city = lat = lon = ip = '' for item in sp: if item.startswith(bCountry:): try: country = item[9:].decode('utf-8') except: print(Exception raised.) country = item[9:] elif item.startswith(bCity:): try: city = item[6:].decode('utf-8') except: print(Exception raised.) city = item[6:] elif item.startswith(bLatitude:): try: lat = item[10:].decode('utf-8') except: print(Exception raised.) lat = item[10] elif item.startswith(bLongitude:): try: lon = item[11:].decode('utf-8') except: print(Exception raised.) lon = item[11] elif item.startswith(bIP:): try: ip = item[4:].decode('utf-8') except: print(Exception raised.) ip = item[4:] return {Country : country, City : city, Lat : lat, Long : lon, IP : ip} if __name__ == __main__: addr = 201.234.178.62 print (IP address is %(IP)s: Country: %(Country)s; City: %(City)s. Lat/Long: %(Lat)s/%(Long)s % ip_info(addr)) The output I get on an Ubuntu 12.4LTS system is as follows: alex@x301:~/Python/Parse$ ./IP_info.py3 Exception raised. IP address is 201.234.178.62: Country: COLOMBIA (CO); City: b'Bogot\xe1'. Lat/Long: 10.4/-75.2833 I would have thought that utf-8 could handle the 'a-acute'. Thanks, alex 'á' does not encode to 0xe1 in utf8 encoding; what you read is probably (legacy) files in probably latin-1 (or another latin-* encoding). Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 01/05/2014 03:31 AM, Alex Kleider wrote: I've been maintaining both a Python3 and a Python2.7 version. The latter has actually opened my eyes to more complexities. Specifically the need to use unicode strings rather than Python2.7's default ascii. So-called Unicode strings are not the solution to all problems. Example with your 'á', which can be represented by either 1 precomposed code (unicode code point) 0xe1, or ibasically by 2 ucodes (one for the base 'a', one for the combining '´'). Imagine you search for Bogotá: how do you know which is reprsentation is used in the text you search? How do you know at all there are multiple representations, and what they are? The routine wil work iff, by chance, your *programming editor* (!) used the same representation as the software used to create the searched test... Usually it the case, because most text-creation software use precomposed codes, when they exist, for composite characters. (But this fact just makes the issue more rare, hard to be aware of, and thus difficult to cope with correctly in code. As far as I know nearly no software does it.) Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 01/05/2014 08:57 AM, Alex Kleider wrote: On 2014-01-04 21:20, Danny Yoo wrote: Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. Well, I've tried the xml approach which seems promising but still I get an encoding related error. .org/mailman/listinfo/tutor Note that the (computing) data description format (JSON, XML...) and the textual format, or encoding (Unicode utf8/16/32, legacy iso-8859-* also called latin-*, ...) are more or less unrelated and independant. Changing the data description format cannot solve a text encoding issue (but it may hide it, if by chance the new data description format happened to use the text encoding you happen to use when reading, implicitely or explicitely). Denis ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 05/01/2014 02:31, Alex Kleider wrote: I've been maintaining both a Python3 and a Python2.7 version. The latter has actually opened my eyes to more complexities. Specifically the need to use unicode strings rather than Python2.7's default ascii. This might help http://python-future.org/ -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 04, 2014 at 11:57:20PM -0800, Alex Kleider wrote: Well, I've tried the xml approach which seems promising but still I get an encoding related error. Is there a bug in the xml.etree module (not very likely, me thinks) or am I doing something wrong? I'm no expert on XML, but it looks to me like it is a bug in ElementTree. It doesn't appear to handle unicode strings correctly (although perhaps it doesn't promise to). A simple demonstration using Python 2.7: py import xml.etree.ElementTree as ET py ET.fromstring(u'xmla/xml') Element 'xml' at 0xb7ca982c But: py ET.fromstring(u'xmlá/xml') Traceback (most recent call last): File stdin, line 1, in module File /usr/local/lib/python2.7/xml/etree/ElementTree.py, line 1282, in XML parser.feed(text) File /usr/local/lib/python2.7/xml/etree/ElementTree.py, line 1622, in feed self._parser.Parse(data, 0) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 5: ordinal not in range(128) An easy work-around: py ET.fromstring(u'xmlá/xml'.encode('utf-8')) Element 'xml' at 0xb7ca9a8c although, as I said, I'm no expert on XML and this may lead to errors later on. There's no denying that the whole encoding issue is still not completely clear to me in spite of having devoted a lot of time to trying to grasp all that's involved. Have you read Joel On Software's explanation? http://www.joelonsoftware.com/articles/Unicode.html It's well worth reading. Start with that, and then ask if you have any further questions. Here's what I've got: alex@x301:~/Python/Parse$ cat ip_xml.py #!/usr/bin/env python # -*- coding : utf -8 -*- # file: 'ip_xml.py' [...] tree = ET.fromstring(xml) root = tree.getroot() # Here's where it blows up!!! I reckon that what you need is to change the first line to: tree = ET.fromstring(xml.encode('latin-1')) or whatever the encoding is meant to be. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider aklei...@sonic.net wrote: def ip_info(ip_address): response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') print 'encoding' is '%s'. % (encoding, ) info = unicode(response.read().decode(encoding)) decode() returns a unicode object. n = info.find('\n') print location of first newline is %s. % (n, ) xml = info[n+1:] print 'xml' is '%s'. % (xml, ) tree = ET.fromstring(xml) root = tree.getroot() # Here's where it blows up!!! print 'root' is '%s', with the following children: % (root, ) for child in root: print child.tag, child.attrib print END of CHILDREN return info Danny walked you through the XML. Note that he didn't decode the response. It includes an encoding on the first line: ?xml version=1.0 encoding=ISO-8859-1 ? Leave it to ElementTree. Here's something to get you started: import urllib2 import xml.etree.ElementTree as ET import collections url_format_str = 'http://api.hostip.info/?ip=%sposition=true' GML = 'http://www.opengis.net/gml' IPInfo = collections.namedtuple('IPInfo', ''' ip city country latitude longitude ''') def ip_info(ip_address): response = urllib2.urlopen(url_format_str % ip_address) tree = ET.fromstring(response.read()) hostip = tree.find('{%s}featureMember/Hostip' % GML) ip = hostip.find('ip').text city = hostip.find('{%s}name' % GML).text country = hostip.find('countryName').text coord = hostip.find('.//{%s}coordinates' % GML).text lon, lat = coord.split(',') return IPInfo(ip, city, country, lat, lon) info = ip_info('201.234.178.62') info.ip '201.234.178.62' info.city, info.country (u'Bogot\xe1', 'COLOMBIA') info.latitude, info.longitude ('10.4', '-75.2833') This assumes everything works perfect. You have to decide how to fail gracefully for the service being unavailable or malformed XML (incomplete or corrupted response, etc). ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-05 08:02, eryksun wrote: On Sun, Jan 5, 2014 at 2:57 AM, Alex Kleider aklei...@sonic.net wrote: def ip_info(ip_address): response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') print 'encoding' is '%s'. % (encoding, ) info = unicode(response.read().decode(encoding)) decode() returns a unicode object. n = info.find('\n') print location of first newline is %s. % (n, ) xml = info[n+1:] print 'xml' is '%s'. % (xml, ) tree = ET.fromstring(xml) root = tree.getroot() # Here's where it blows up!!! print 'root' is '%s', with the following children: % (root, ) for child in root: print child.tag, child.attrib print END of CHILDREN return info Danny walked you through the XML. Note that he didn't decode the response. It includes an encoding on the first line: ?xml version=1.0 encoding=ISO-8859-1 ? Leave it to ElementTree. Here's something to get you started: import urllib2 import xml.etree.ElementTree as ET import collections url_format_str = 'http://api.hostip.info/?ip=%sposition=true' GML = 'http://www.opengis.net/gml' IPInfo = collections.namedtuple('IPInfo', ''' ip city country latitude longitude ''') def ip_info(ip_address): response = urllib2.urlopen(url_format_str % ip_address) tree = ET.fromstring(response.read()) hostip = tree.find('{%s}featureMember/Hostip' % GML) ip = hostip.find('ip').text city = hostip.find('{%s}name' % GML).text country = hostip.find('countryName').text coord = hostip.find('.//{%s}coordinates' % GML).text lon, lat = coord.split(',') return IPInfo(ip, city, country, lat, lon) info = ip_info('201.234.178.62') info.ip '201.234.178.62' info.city, info.country (u'Bogot\xe1', 'COLOMBIA') info.latitude, info.longitude ('10.4', '-75.2833') This assumes everything works perfect. You have to decide how to fail gracefully for the service being unavailable or malformed XML (incomplete or corrupted response, etc). Thanks again for the input. You're using some ET syntax there that would probably make my code much more readable but will require a bit more study on my part. I was up all night trying to get this sorted out and was finally successful. (Re-) Reading 'joelonsoftware' and some of the Python docs helped. Here's what I came up with (still needs modification to return a dictionary, but that'll be trivial.) alex@x301:~/Python/Parse$ cat ip_xml.py #!/usr/bin/env python # vim: set fileencoding=utf-8 : # -*- coding : utf-8 -*- # file: 'ip_xml.py' import urllib2 import xml.etree.ElementTree as ET url_format_str = \ u'http://api.hostip.info/?ip=%sposition=true' def ip_info(ip_address): response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') info = response.read().decode(encoding) # info comes in as type 'unicode'. n = info.find('\n') xml = info[n+1:] # Get rid of a header line. # root = ET.fromstring(xml) # This causes error: # UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' # in position 456: ordinal not in range(128) root = ET.fromstring(xml.encode(utf-8)) # This is the part I still don't fully understand but would # probably have to look at the library source to do so. info = [] for i in range(4): info.append(root[3][0][i].text) info.append(root[3][0][4][0][0][0].text) return info if __name__ == __main__: info = ip_info(201.234.178.62) print info print info[1] alex@x301:~/Python/Parse$ ./ip_xml.py ['201.234.178.62', u'Bogot\xe1', 'COLOMBIA', 'CO', '-75.2833,10.4'] Bogotá Thanks to all who helped. ak ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote: Danny walked you through the XML. Note that he didn't decode the response. It includes an encoding on the first line: ?xml version=1.0 encoding=ISO-8859-1 ? That surprises me. I thought XML was only valid in UTF-8? Or maybe that was wishful thinking. tree = ET.fromstring(response.read()) In other words, leave it to ElementTree to manage the decoding and encoding itself. Nice -- I like that solution. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-05 14:26, Steven D'Aprano wrote: On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote: Danny walked you through the XML. Note that he didn't decode the response. It includes an encoding on the first line: ?xml version=1.0 encoding=ISO-8859-1 ? That surprises me. I thought XML was only valid in UTF-8? Or maybe that was wishful thinking. tree = ET.fromstring(response.read()) I believe you were correct the first time. My experience with all of this has been that in spite of the xml having been advertised as having been encoded in ISO-8859-1 (which I believe is synonymous with Latin-1), my script (specifically Python's xml parser: xml.etree.ElementTree) didn't work until the xml was decoded from Latin-1 (into Unicode) and then encoded into UTF-8. Here's the snippet with some comments mentioning the painful lessons learned: response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') info = response.read().decode(encoding) # info comes in as type 'unicode'. n = info.find('\n') xml = info[n+1:] # Get rid of a header line. # root = ET.fromstring(xml) # This causes error: # UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' # in position 456: ordinal not in range(128) root = ET.fromstring(xml.encode(utf-8)) In other words, leave it to ElementTree to manage the decoding and encoding itself. Nice -- I like that solution. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sun, Jan 5, 2014 at 5:26 PM, Steven D'Aprano st...@pearwood.info wrote: On Sun, Jan 05, 2014 at 11:02:34AM -0500, eryksun wrote: ?xml version=1.0 encoding=ISO-8859-1 ? That surprises me. I thought XML was only valid in UTF-8? Or maybe that was wishful thinking. JSON text SHALL be encoded in Unicode: https://tools.ietf.org/html/rfc4627#section-3 For XML, UTF-8 is recommended by RFC 3023, but not required. Also, the MIME charset takes precedence. Section 8 has examples: https://tools.ietf.org/html/rfc3023#section-8 So I was technically wrong to rely on the XML encoding (they happen to be the same in this case). Instead you can create a parser with the encoding from the header: encoding = response.headers.getparam('charset') parser = ET.XMLParser(encoding=encoding) tree = ET.parse(response, parser) The expat parser (pyexpat) used by Python is limited to ASCII, Latin-1 and Unicode transport encodings. So it's probably better to transcode to UTF-8 as Alex is doing, but then use a custom parser to override the XML encoding: encoding = response.headers.getparam('charset') info = response.read().decode(encoding).encode('utf-8') parser = ET.XMLParser(encoding='utf-8') tree = ET.fromstring(info, parser) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider aklei...@sonic.net wrote: The output I get on an Ubuntu 12.4LTS system is as follows: alex@x301:~/Python/Parse$ ./IP_info.py3 Exception raised. IP address is 201.234.178.62: Country: COLOMBIA (CO); City: b'Bogot\xe1'. Lat/Long: 10.4/-75.2833 I would have thought that utf-8 could handle the 'a-acute'. b'\xe1' is Latin-1. Look in the response headers: url = 'http://api.hostip.info/get_html.php?ip=201.234.178.62position=true' response = urllib.request.urlopen(url) response.headers.get_charsets() ['iso-8859-1'] encoding = response.headers.get_charsets()[0] sp = response.read().decode(encoding).splitlines() sp[1] 'City: Bogotá' ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 12:01, eryksun wrote: On Sat, Jan 4, 2014 at 2:26 PM, Alex Kleider aklei...@sonic.net wrote: . b'\xe1' is Latin-1. Look in the response headers: url = 'http://api.hostip.info/get_html.php?ip=201.234.178.62position=true' response = urllib.request.urlopen(url) response.headers.get_charsets() ['iso-8859-1'] encoding = response.headers.get_charsets()[0] sp = response.read().decode(encoding).splitlines() sp[1] 'City: Bogotá' Thank you very much. Now things are more clear. cheers, alex ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote: Any suggestions as to a better way to handle the problem of encoding in the following context would be appreciated. Python gives you lots of useful information when errors occur, but unfortunately your code throws that information away and replaces it with a totally useless message: try: country = item[9:].decode('utf-8') except: print(Exception raised.) Oh great. An exception was raised. What sort of exception? What error message did it have? Why did it happen? Nobody knows, because you throw it away. Never, never, never do this. If you don't understand an exception, you have no business covering it up and hiding that it took place. Never use a bare try...except, always catch the *smallest* number of specific exception types that make sense. Better is to avoid catching exceptions at all: an exception (usually) means something has gone wrong. You should aim to fix the problem *before* it blows up, not after. I'm reminded of a quote: I find it amusing when novice programmers believe their main job is preventing programs from crashing. ... More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn't crash is a horrible nightmare. -- Chris Smith Your code is incorrect, it does the wrong thing, but it doesn't crash, it just covers up the fact that an exception occured. The output I get on an Ubuntu 12.4LTS system is as follows: alex@x301:~/Python/Parse$ ./IP_info.py3 Exception raised. IP address is 201.234.178.62: Country: COLOMBIA (CO); City: b'Bogot\xe1'. Lat/Long: 10.4/-75.2833 I would have thought that utf-8 could handle the 'a-acute'. Of course it can: py 'Bogotá'.encode('utf-8') b'Bogot\xc3\xa1' py b'Bogot\xc3\xa1'.decode('utf-8') 'Bogotá' But you don't have UTF-8. You have something else, and trying to decode it using UTF-8 fails. py b'Bogot\xe1'.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: unexpected end of data More to follow... -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 15:52, Steven D'Aprano wrote: Oh great. An exception was raised. What sort of exception? What error message did it have? Why did it happen? Nobody knows, because you throw it away. Never, never, never do this. If you don't understand an exception, you have no business covering it up and hiding that it took place. Never use a bare try...except, always catch the *smallest* number of specific exception types that make sense. Better is to avoid catching exceptions at all: an exception (usually) means something has gone wrong. You should aim to fix the problem *before* it blows up, not after. I'm reminded of a quote: I find it amusing when novice programmers believe their main job is preventing programs from crashing. ... More experienced programmers realize that correct code is great, code that crashes could use improvement, but incorrect code that doesn't crash is a horrible nightmare. -- Chris Smith Your code is incorrect, it does the wrong thing, but it doesn't crash, it just covers up the fact that an exception occured. The output I get on an Ubuntu 12.4LTS system is as follows: alex@x301:~/Python/Parse$ ./IP_info.py3 Exception raised. IP address is 201.234.178.62: Country: COLOMBIA (CO); City: b'Bogot\xe1'. Lat/Long: 10.4/-75.2833 I would have thought that utf-8 could handle the 'a-acute'. Of course it can: py 'Bogotá'.encode('utf-8') I'm interested in knowing how you were able to enter the above line (assuming you have a key board similar to mine.) b'Bogot\xc3\xa1' py b'Bogot\xc3\xa1'.decode('utf-8') 'Bogotá' But you don't have UTF-8. You have something else, and trying to decode it using UTF-8 fails. py b'Bogot\xe1'.decode('utf-8') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 5: unexpected end of data More to follow... I very much agree with your remarks. In a pathetic attempt at self defence I just want to mention that what I presented wasn't what I thought was a solution. Rather it was an attempt to figure out what the problem was as a preliminary step to fixing it. With help from you and others, I was successful in doing this. And for that help, I thank all list participants very much. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 4, 2014 at 7:15 PM, Alex Kleider aklei...@sonic.net wrote: py 'Bogotá'.encode('utf-8') I'm interested in knowing how you were able to enter the above line (assuming you have a key board similar to mine.) I use an international keyboard layout: https://en.wikipedia.org/wiki/QWERTY#US-International One could also copy and paste from a printed literal: 'Bogot\xe1' 'Bogotá' Or more verbosely: 'Bogot\N{latin small letter a with acute}' 'Bogotá' ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
Following my previous email... On Sat, Jan 04, 2014 at 11:26:35AM -0800, Alex Kleider wrote: Any suggestions as to a better way to handle the problem of encoding in the following context would be appreciated. The problem arose because 'Bogota' is spelt with an acute accent on the 'a'. Eryksun has given the right answer for how to extract the encoding from the webpage's headers. That will help 9 times out of 10. But unfortunately sometimes webpages will lack an encoding header, or they will lie, or the text will be invalid for that encoding. What to do then? Let's start by factoring out the repeated code in your giant for-loop into something more manageable and maintainable: sp = response.splitlines() country = city = lat = lon = ip = '' for item in sp: if item.startswith(bCountry:): try: country = item[9:].decode('utf-8') except: print(Exception raised.) country = item[9:] elif item.startswith(bCity:): try: city = item[6:].decode('utf-8') except: print(Exception raised.) city = item[6:] and so on, becomes: encoding = ... # as per Eryksun's email sp = response.splitlines() country = city = lat = lon = ip = '' for item in sp: key, value = item.split(':', 1) key = key.decode(encoding).strip() value = value.decode(encoding).strip() if key == 'Country': country = value elif key == 'City': city = value elif key == 'Latitude': lat = value elif key = Longitude: lon = value elif key = 'IP': ip = value else: raise ValueError('unknown key %s found' % key) return {Country : country, City : city, Lat : lat, Long : lon, IP : ip } But we can do better than that! encoding = ... # as per Eryksun's email sp = response.splitlines() record = {Country: None, City: None, Latitude: None, Longitude: None, IP: None} for item in sp: key, value = item.split(':', 1) key = key.decode(encoding).strip() value = value.decode(encoding).strip() if key in record: record[key] = value else: raise ValueError('unknown key %s found' % key) if None in list(record.values()): for key, value in record.items(): if value is None: break raise ValueError('missing key in record: %s' % key) return record This simplifies the code a lot, and adds some error-handling. It may be appropriate for your application to handle missing keys by using some default value, such as an empty string, or some other value that cannot be mistaken for an actual value, say *missing*. But since I don't know your application's needs, I'm going to leave that up to you. Better to start strict and loosen up later, than start too loose and never realise that errors are occuring. I've also changed the keys Lat and Lon to Latitude and Longitude. If that's a problem, it's easy to fix. Just before returning the record, change the key: record['Lat'] = record.pop('Latitude') and similar for Longitude. Now that the code is simpler to read and maintain, we can start dealing with the risk that the encoding will be missing or wrong. A missing encoding is easy to handle: just pick a default encoding, and hope it is the right one. UTF-8 is a good choice. (It's the only *correct* choice, everybody should be using UTF-8, but alas they often don't.) So modify Eryksun's code snippet to return 'UTF-8' if the header is missing, and you should be good. How to deal with incorrect encodings? That can happen when the website creator *thinks* they are using a certain encoding, but somehow invalid bytes for that encoding creep into the data. That gives us a few different strategies: (1) The third-party chardet module can analyse text and try to guess what encoding it *actually* is, rather than what encoding it claims to be. This is what Firefox and other web browsers do, because there are an awful lot of shitty websites out there. But it's not foolproof, so even if it guesses correctly, you still have to deal with invalid data. (2) By default, the decode method will raise an exception. You can catch the exception and try again with a different encoding: for codec in (encoding, 'utf-8', 'latin-1'): try: key = key.decode(codec) except UnicodeDecodeError: pass else: break Latin-1 should be last, because it has the nice property that it will *always* succeed. That doesn't mean it will give you the right characters, as intended by the person who wrote the website, just that it will always give you *some* characters. They may be completely wrong, in other
Re: [Tutor] encoding question
On Sat, Jan 04, 2014 at 04:15:30PM -0800, Alex Kleider wrote: py 'Bogotá'.encode('utf-8') I'm interested in knowing how you were able to enter the above line (assuming you have a key board similar to mine.) I'm running Linux, and I use the KDE or Gnome character selector, depending on which computer I'm using. They give you a graphical window showing a screenful of characters at a time, depending on which application I'm using you can search for characters by name or property, then copy them into the clipboard to paste them into another application. I can also use the compose key. My keyboard doesn't have an actual key labelled compose, but my system is set to use the right-hand Windows key (between Alt and the menu key) as the compose key. (Why the left-hand Windows key isn't set to do the same thing is a mystery to me.) So if I type: Compose 'a I get á. The problem with the compose key is that it's not terribly intuitive. Sure, a few of them are: Compose 1 2 gives ½ but how do I get π (pi)? Compose p doesn't work. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
A heartfelt thank you to those of you that have given me much to ponder with your helpful responses. In the mean time I've rewritten my procedure using a different approach all together. I'd be interested in knowing if you think it's worth keeping or do you suggest I use your revisions to my original hack? I've been maintaining both a Python3 and a Python2.7 version. The latter has actually opened my eyes to more complexities. Specifically the need to use unicode strings rather than Python2.7's default ascii. Here it is: alex@x301:~/Python/Parse$ cat ip_info.py #!/usr/bin/env python # -*- coding : utf -8 -*- import re import urllib2 url_format_str = \ u'http://api.hostip.info/get_html.php?ip=%sposition=true' info_exp = r Country:[ ](?Pcountry.*) [\n] City:[ ](?Pcity.*) [\n] [\n] Latitude:[ ](?Plat.*) [\n] Longitude:[ ](?Plon.*) [\n] IP:[ ](?Pip.*) info_pattern = re.compile(info_exp, re.VERBOSE).search def ip_info(ip_address): Returns a dictionary keyed by Country, City, Lat, Long and IP. Depends on http://api.hostip.info (which returns the following: 'Country: UNITED STATES (US)\nCity: Santa Rosa, CA\n\nLatitude: 38.4486\nLongitude: -122.701\nIP: 76.191.204.54\n'.) THIS COULD BREAK IF THE WEB SITE GOES AWAY!!! response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') info = info_pattern(response.read().decode(encoding)) return {Country : unicode(info.group(country)), City : unicode(info.group(city)), Lat : unicode(info.group(lat)), Lon : unicode(info.group(lon)), IP : unicode(info.group(ip))} if __name__ == __main__: print IP address is %(IP)s: Country: %(Country)s; City: %(City)s. Lat/Long: %(Lat)s/%(Lon)s % ip_info(201.234.178.62) Apart from soliciting your general comments, I'm also interested to know exactly what the line # -*- coding : utf -8 -*- really indicates or more importantly, is it true, since I am using vim and I assume things are encoded as ascii? I've discovered that with Ubuntu it's very easy to switch from English (US) to English (US, international with dead keys) with just two clicks so thanks for that tip as well. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
Hi Alex, According to: http://www.hostip.info/use.html there is a JSON-based interface. I'd recommend using that one! JSON is a format that's easy for machines to decode. The format you're parsing is primarily for humans, and who knows if that will change in the future to make it easier to read? Not only is JSON probably more reliable to parse, but the code itself should be fairly straightforward. For example: # ## In Python 2.7 ## import json import urllib response = urllib.urlopen('http://api.hostip.info/get_json.php') info = json.load(response) info {u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA', u'country_code': u'US', u'ip': u'216.239.45.81'} # Best of wishes! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
You were asking earlier about the line: # -*- coding : utf -8 -*- See PEP 263: http://www.python.org/dev/peps/pep-0263/ http://docs.python.org/release/2.3/whatsnew/section-encodings.html It's a line that tells Python how to interpret the bytes of your source program. It allows us to write unicode literal strings embedded directly in the program source itself. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, 04 Jan 2014 18:31:13 -0800, Alex Kleider aklei...@sonic.net wrote: exactly what the line # -*- coding : utf -8 -*- really indicates or more importantly, is it true, since I am using vim and I assume things are encoded as ascii? I don't know vim specifically, but I'm 99% sure it will let you specify the encoding,. Certainly emacs does, so I'd not expect vim to fall behind on such a fundamental point. Anyway it's also likely that it defaults to utf for new files. Anyway your job is to make sure that the encoding line matches what the editor is using. Emacs also looks in the first few lines for that same encoding line, so if you format it carefully, it'll just work. Easy to test anyway for yourself. Just paste some international characters into a literal string. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 18:44, Danny Yoo wrote: Hi Alex, According to: http://www.hostip.info/use.html there is a JSON-based interface. I'd recommend using that one! JSON is a format that's easy for machines to decode. The format you're parsing is primarily for humans, and who knows if that will change in the future to make it easier to read? Not only is JSON probably more reliable to parse, but the code itself should be fairly straightforward. For example: # ## In Python 2.7 ## import json import urllib response = urllib.urlopen('http://api.hostip.info/get_json.php') info = json.load(response) info {u'country_name': u'UNITED STATES', u'city': u'Mountain View, CA', u'country_code': u'US', u'ip': u'216.239.45.81'} # This strikes me as being the most elegant solution to date, and I thank you for it! The problem is that the city name doesn't come in: alex@x301:~/Python/Parse$ cat tutor.py #!/usr/bin/env python # -*- coding : utf -8 -*- # file: 'tutor.py' Put your docstring here. print Running 'tutor.py'... import json import urllib response = urllib.urlopen\ ('http://api.hostip.info/get_json.php?ip=201.234.178.62position=true') info = json.load(response) print info alex@x301:~/Python/Parse$ ./tutor.py Running 'tutor.py'... {u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', u'country_code': u'CO', u'country_name': u'COLOMBIA', u'lng': u'-75.2833'} If I use my own IP the city comes in fine so there must still be some problem with the encoding. should I be using encoding = response.headers.getparam('charset') in there somewhere? Any ideas? ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On Sat, Jan 4, 2014 at 11:16 PM, Alex Kleider aklei...@sonic.net wrote: {u'city': None, u'ip': u'201.234.178.62', u'lat': u'10.4', u'country_code': u'CO', u'country_name': u'COLOMBIA', u'lng': u'-75.2833'} If I use my own IP the city comes in fine so there must still be some problem with the encoding. Report a bug in their JSON API. It's returning b'city:null'. I see the same problem for www.msj.go.cr in San José, Costa Rica. It's probably broken for all non-ASCII byte strings. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. But I truly dislike XML for being implemented in ways that are usually not fun to navigate: either the APIs or the encoded data are usually convoluted enough to make it a chore rather than a pleasure. The beginning does look similar: ## import xml.etree.ElementTree as ET import urllib response = urllib.urlopen(http://api.hostip.info?ip=201.234.178.62position=true;) tree = ET.parse(response) tree xml.etree.ElementTree.ElementTree object at 0x185a2d0 ## Up to this point, not so bad. But this is where it starts to look silly: ## tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text '201.234.178.62' tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text u'Bogot\xe1' ## where we need to deal with XML namespaces, an extra complexity for a benefit that I have never bought into. More than that, usually the XML I run into in practice isn't even properly structured, as is the case with the lat-long value in the XML output here: ## tree.find('.//{http://www.opengis.net/gml}coordinates').text '-75.2833,10.4' ## which is truly silly. Why is the latitude and longitude not two separate, structured values? What is this XML buying us here, really then? I'm convinced that all the extraneous structure and complexity in XML causes the people who work with it to stop caring, the result being something that isn't for the benefit of either humans nor computer programs. Hence, that's why I prefer JSON: JSON export is usually a lot more sensible, for reasons that I can speculate on, but I probably should stop this rant. :P ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
then? I'm convinced that all the extraneous structure and complexity in XML causes the people who work with it to stop caring, the result being something that isn't for the benefit of either humans nor computer programs. ... I'm sorry. Sometimes I get grumpy when I haven't had a Snickers. I should not have said the above here. It isn't factual, and worse, it insinuates an uncharitable intent to people who I do not know. There's enough insinuation and insults out there in the world already: I should not be contributing to those things. For that, I apologize. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 21:20, Danny Yoo wrote: Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. But I truly dislike XML for being implemented in ways that are usually not fun to navigate: either the APIs or the encoded data are usually convoluted enough to make it a chore rather than a pleasure. The beginning does look similar: ## import xml.etree.ElementTree as ET import urllib response = urllib.urlopen(http://api.hostip.info?ip=201.234.178.62position=true;) tree = ET.parse(response) tree xml.etree.ElementTree.ElementTree object at 0x185a2d0 ## Up to this point, not so bad. But this is where it starts to look silly: ## tree.find('{http://www.opengis.net/gml}featureMember/Hostip/ip').text '201.234.178.62' tree.find('{http://www.opengis.net/gml}featureMember/Hostip/{http://www.opengis.net/gml}name').text u'Bogot\xe1' ## where we need to deal with XML namespaces, an extra complexity for a benefit that I have never bought into. More than that, usually the XML I run into in practice isn't even properly structured, as is the case with the lat-long value in the XML output here: ## tree.find('.//{http://www.opengis.net/gml}coordinates').text '-75.2833,10.4' ## which is truly silly. Why is the latitude and longitude not two separate, structured values? What is this XML buying us here, really then? I'm convinced that all the extraneous structure and complexity in XML causes the people who work with it to stop caring, the result being something that isn't for the benefit of either humans nor computer programs. Hence, that's why I prefer JSON: JSON export is usually a lot more sensible, for reasons that I can speculate on, but I probably should stop this rant. :P Not a rant at all. As it turns out, one of the other things that have interested me of late is docbook, an xml dialect (I think this is the correct way to express it.) I've found it very useful and so do not share your distaste for xml although one can't disagree with the points you've made with regard to xml as a solution to the problem under discussion. I've not played with the python xml interfaces before so this will be a good project for me. Thanks. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding question
On 2014-01-04 21:20, Danny Yoo wrote: Oh! That's unfortunate! That looks like a bug on the hostip.info side. Check with them about it. I can't get the source code to whatever is implementing the JSON response, so I can not say why the city is not being properly included there. [... XML rant about to start. I am not disinterested, so my apologies in advance.] ... In that case... I suppose trying the XML output is a possible approach. Well, I've tried the xml approach which seems promising but still I get an encoding related error. Is there a bug in the xml.etree module (not very likely, me thinks) or am I doing something wrong? There's no denying that the whole encoding issue is still not completely clear to me in spite of having devoted a lot of time to trying to grasp all that's involved. Here's what I've got: alex@x301:~/Python/Parse$ cat ip_xml.py #!/usr/bin/env python # -*- coding : utf -8 -*- # file: 'ip_xml.py' import urllib2 import xml.etree.ElementTree as ET url_format_str = \ u'http://api.hostip.info/?ip=%sposition=true' def ip_info(ip_address): response = urllib2.urlopen(url_format_str %\ (ip_address, )) encoding = response.headers.getparam('charset') print 'encoding' is '%s'. % (encoding, ) info = unicode(response.read().decode(encoding)) n = info.find('\n') print location of first newline is %s. % (n, ) xml = info[n+1:] print 'xml' is '%s'. % (xml, ) tree = ET.fromstring(xml) root = tree.getroot() # Here's where it blows up!!! print 'root' is '%s', with the following children: % (root, ) for child in root: print child.tag, child.attrib print END of CHILDREN return info if __name__ == __main__: info = ip_info(201.234.178.62) alex@x301:~/Python/Parse$ ./ip_xml.py 'encoding' is 'iso-8859-1'. location of first newline is 44. 'xml' is 'HostipLookupResultSet version=1.0.1 xmlns:gml=http://www.opengis.net/gml; xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; xsi:noNamespaceSchemaLocation=http://www.hostip.info/api/hostip-1.0.1.xsd; gml:descriptionThis is the Hostip Lookup Service/gml:description gml:namehostip/gml:name gml:boundedBy gml:Nullinapplicable/gml:Null /gml:boundedBy gml:featureMember Hostip ip201.234.178.62/ip gml:nameBogotá/gml:name countryNameCOLOMBIA/countryName countryAbbrevCO/countryAbbrev !-- Co-ordinates are available as lng,lat -- ipLocation gml:pointProperty gml:Point srsName=http://www.opengis.net/gml/srs/epsg.xml#4326; gml:coordinates-75.2833,10.4/gml:coordinates /gml:Point /gml:pointProperty /ipLocation /Hostip /gml:featureMember /HostipLookupResultSet '. Traceback (most recent call last): File ./ip_xml.py, line 33, in module info = ip_info(201.234.178.62) File ./ip_xml.py, line 23, in ip_info tree = ET.fromstring(xml) File /usr/lib/python2.7/xml/etree/ElementTree.py, line 1301, in XML parser.feed(text) File /usr/lib/python2.7/xml/etree/ElementTree.py, line 1641, in feed self._parser.Parse(data, 0) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 456: ordinal not in range(128) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding error when reading text files in Python 3
Dat Huynh wrote: Dear all, I have written a simple application by Python to read data from text files. Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop. I don't know why it does not run on Python version 3 while it runs well on Python 2. Python 2 is more forgiving of beginner errors when dealing with text and bytes, but makes it harder to deal with text correctly. Python 3 makes it easier to deal with text correctly, but is less forgiving. When you read from a file in Python 2, it will give you *something*, even if it is the wrong thing. It will not give an decoding error, even if the text you are reading is not valid text. It will just give you junk bytes, sometimes known as moji-bake. Python 3 no longer does that. It tells you when there is a problem, so you can fix it. Could you please tell me how I can run it on python 3? Following is my Python code. -- for subdir, dirs, files in os.walk(rootdir): for file in files: print(Processing [ +file +]...\n ) f = open(rootdir+file, 'r') data = f.read() f.close() print(data) -- This is the error message: [...] UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 4980: ordinal not in range(128) This tells you that you are reading a non-ASCII file but haven't told Python what encoding to use, so by default Python uses ASCII. Do you know what encoding the file is? Do you understand about Unicode text and bytes? If not, I suggest you read this article: http://www.joelonsoftware.com/articles/Unicode.html In Python 3, you can either tell Python what encoding to use: f = open(rootdir+file, 'r', encoding='utf8') # for example or you can set an error handler: f = open(rootdir+file, 'r', errors='ignore') # for example or both f = open(rootdir+file, 'r', encoding='ascii', errors='replace') You can see the list of encodings and error handlers here: http://docs.python.org/py3k/library/codecs.html Unfortunately, Python 2 does not support this using the built-in open function. Instead, you have to uses codecs.open instead of the built-in open, like this: import codecs f = codecs.open(rootdir+file, 'r', encoding='utf8') # for example which fortunately works in both Python 2 or 3. Or you can read the file in binary mode, and then decode it into text: f = open(rootdir+file, 'rb') data = f.read() f.close() text = data.decode('cp866', 'replace') print(text) If you don't know the encoding, you can try opening the file in Firefox or Internet Explorer and see if they can guess it, or you can use the chardet library in Python. http://pypi.python.org/pypi/chardet Or if you don't care about getting moji-bake, you can pretend that the file is encoded using Latin-1. That will pretty much read anything, although what it gives you may be junk. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding error when reading text files in Python 3
I change my code and it runs on Python 3 now. f = open(rootdir+file, 'rb') data = f.read().decode('utf8', 'ignore') Thank you very much. Sincerely, Dat. On Sat, Jul 28, 2012 at 6:09 PM, Steven D'Aprano st...@pearwood.info wrote: Dat Huynh wrote: Dear all, I have written a simple application by Python to read data from text files. Current I have both Python version 2.7.2 and Python 3.2.3 on my laptop. I don't know why it does not run on Python version 3 while it runs well on Python 2. Python 2 is more forgiving of beginner errors when dealing with text and bytes, but makes it harder to deal with text correctly. Python 3 makes it easier to deal with text correctly, but is less forgiving. When you read from a file in Python 2, it will give you *something*, even if it is the wrong thing. It will not give an decoding error, even if the text you are reading is not valid text. It will just give you junk bytes, sometimes known as moji-bake. Python 3 no longer does that. It tells you when there is a problem, so you can fix it. Could you please tell me how I can run it on python 3? Following is my Python code. -- for subdir, dirs, files in os.walk(rootdir): for file in files: print(Processing [ +file +]...\n ) f = open(rootdir+file, 'r') data = f.read() f.close() print(data) -- This is the error message: [...] UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 4980: ordinal not in range(128) This tells you that you are reading a non-ASCII file but haven't told Python what encoding to use, so by default Python uses ASCII. Do you know what encoding the file is? Do you understand about Unicode text and bytes? If not, I suggest you read this article: http://www.joelonsoftware.com/articles/Unicode.html In Python 3, you can either tell Python what encoding to use: f = open(rootdir+file, 'r', encoding='utf8') # for example or you can set an error handler: f = open(rootdir+file, 'r', errors='ignore') # for example or both f = open(rootdir+file, 'r', encoding='ascii', errors='replace') You can see the list of encodings and error handlers here: http://docs.python.org/py3k/library/codecs.html Unfortunately, Python 2 does not support this using the built-in open function. Instead, you have to uses codecs.open instead of the built-in open, like this: import codecs f = codecs.open(rootdir+file, 'r', encoding='utf8') # for example which fortunately works in both Python 2 or 3. Or you can read the file in binary mode, and then decode it into text: f = open(rootdir+file, 'rb') data = f.read() f.close() text = data.decode('cp866', 'replace') print(text) If you don't know the encoding, you can try opening the file in Firefox or Internet Explorer and see if they can guess it, or you can use the chardet library in Python. http://pypi.python.org/pypi/chardet Or if you don't care about getting moji-bake, you can pretend that the file is encoded using Latin-1. That will pretty much read anything, although what it gives you may be junk. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Well, I am assuming that by this you mean converting user input into a string, and then extracting the numerals (0-9) from it. Next time, please tell us your version of Python. I'll do my best to help with this. You might try the following: the_input = input(Insert string here: ) # change to raw_input in python 2 after = for char in the_input: try: char = int(char) except: after += char If other symbols might be in the string ($, @, etc.), then you might use the_input = input('Insert string here: ') # change to raw_input in python 2 after = '' not_allowed = '1234567890-=!@#$%^**()_+,./?`~[]{}\\|' for char in the_input: if char in not_allowed: pass else: after += char This method requires more typing, but it works with a wider variety of characters. Hopefully this helped. On Thu, Nov 17, 2011 at 8:45 PM, Nidian Job-Smith nidia...@hotmail.comwrote: Hi all, In my programme I am encoding what the user has in-putted. What the user inputs will in a string, which might a mixture of letters and numbers. However I only want the letters to be encoded. Does any-one how I can only allow the characters to be encoded ?? Big thanks, ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On 11/17/2011 8:45 PM, Nidian Job-Smith wrote: Hi all, In my programme I am encoding what the user has in-putted. What the user inputs will in a string, which might a mixture of letters and numbers. However I only want the letters to be encoded. I am assuming that you meant only accept characters and not actual text encoding. The following example is untested and is limited. It will not really work with non-ASCII letters (i.e. Unicode). import string input_string = raw_input( 'Enter something' ) #use input in Python3 final_input = [] # append to a list instead of concatenating a string # because it is faster to ''.join( list ) for char in input_string: if char in string.letters: final_input.append( char ) input_string = ''.join( final_input ) Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 -- This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On 11/17/2011 8:45 PM, Nidian Job-Smith wrote: Hi all, In my programme I am encoding what the user has in-putted. What the user inputs will in a string, which might a mixture of letters and numbers. However I only want the letters to be encoded. Does any-one how I can only allow the characters to be encoded ?? Your question makes no sense to me. Please explain what you mean by encoding letters? An example of input and output might also help. Be sure to reply-all. -- Bob Gailer 919-636-4239 Chapel Hill NC ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
2010/3/7 spir denis.s...@gmail.com Oh, right. And, if i'm not wrong B is an UTF8 string decoded to unicode (due to the coding: statement at the top of the file) and re-encoded to latin1 Si! :-) Ahah. Ok, Grazie! One more question: Amazon SimpleDB only accepts UTF8. So, let's say i have to put into an image file: filestream = file.read() filetoput = filestream.encode('utf-8') Do you think this is ok? Oh, of course everything url-encoded then Giorgio Denis -- la vita e estrany spir.wikidot.com -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On Sun, 7 Mar 2010 13:23:12 +0100 Giorgio anothernetfel...@gmail.com wrote: One more question: Amazon SimpleDB only accepts UTF8. [...] filestream = file.read() filetoput = filestream.encode('utf-8') No! What is the content of the file? Do you think it can be a pure python representation of a unicode text? uContent = inFile.read().decode(***format***) process, if any outFile.write(uContent.encode('utf-8')) input --decode-- process --encode-- output This gives me an idea: when working with unicode, it would be cool to have an optional format parameter for file.read() and write. So, the above would be: uContent = inFile.read(***format***) process, if any outFile.write(uContent, 'utf-8') Or, maybe even better, the format could be given as third parameter of file open(); then any read or write operation would directly convert from/to the said format. What do you all think? denis -- la vita e estrany spir.wikidot.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Or, maybe even better, the format could be given as third parameter of file open(); then any read or write operation would directly convert from/to the said format. What do you all think? See the codecs.open() command as an alternative to open(). With all the hassles of encoding, I'm puzzled why anyone would use the regular open() for anything but binary operations. Malcolm - Original message - From: spir denis.s...@gmail.com To: Python tutor tutor@python.org Date: Sun, 7 Mar 2010 14:29:11 +0100 Subject: Re: [Tutor] Encoding On Sun, 7 Mar 2010 13:23:12 +0100 Giorgio anothernetfel...@gmail.com wrote: One more question: Amazon SimpleDB only accepts UTF8. [...] filestream = file.read() filetoput = filestream.encode('utf-8') No! What is the content of the file? Do you think it can be a pure python representation of a unicode text? uContent = inFile.read().decode(***format***) process, if any outFile.write(uContent.encode('utf-8')) input --decode-- process --encode-- output This gives me an idea: when working with unicode, it would be cool to have an optional format parameter for file.read() and write. So, the above would be: uContent = inFile.read(***format***) process, if any outFile.write(uContent, 'utf-8') Or, maybe even better, the format could be given as third parameter of file open(); then any read or write operation would directly convert from/to the said format. What do you all think? denis -- la vita e estrany spir.wikidot.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: 2010/3/7 spir denis.s...@gmail.com One more question: Amazon SimpleDB only accepts UTF8. So, let's say i have to put into an image file: Do you mean a binary file with image data, such as a jpeg? In that case, an emphatic - NO. not even close. filestream = file.read() filetoput = filestream.encode('utf-8') Do you think this is ok? Oh, of course everything url-encoded then Giorgio Encoding binary data with utf-8 wouldn't make any sense, even if you did have the right semantics for a text file. Next problem, 'file' is a built-in keyword. So if you write what you describe, you're trying to call a non-static function with a class object, which will error. Those two lines don't make any sense by themselves. Show us some context, and we can more sensibly comment on them. And try not to use names that hide built-in keywords, or Python stdlib names. If you're trying to store binary data in a repository that only permits text, it's not enough to pretend to convert it to UTF-8. You need to do some other escaping, such as UUENCODE, that transforms the binary data into something resembling text. Then you may or may not need to encode that text with utf-8, depending on its character set. DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
2010/3/7 Dave Angel da...@ieee.org Those two lines don't make any sense by themselves. Show us some context, and we can more sensibly comment on them. And try not to use names that hide built-in keywords, or Python stdlib names. Hi Dave, I'm considering Amazon SimpleDB as an alternative to PGSQL, but i need to store blobs. Amazon's FAQs says that: Q: What kind of data can I store? You can store any UTF-8 string data in Amazon SimpleDB. Please refer to the Amazon Web Services Customer Agreement http://aws.amazon.com/agreement for details. This is the problem. Any idea? DaveA Giorgio -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: 2010/3/7 Dave Angel da...@ieee.org Those two lines don't make any sense by themselves. Show us some context, and we can more sensibly comment on them. And try not to use names that hide built-in keywords, or Python stdlib names. Hi Dave, I'm considering Amazon SimpleDB as an alternative to PGSQL, but i need to store blobs. Amazon's FAQs says that: Q: What kind of data can I store? You can store any UTF-8 string data in Amazon SimpleDB. Please refer to the Amazon Web Services Customer Agreement http://aws.amazon.com/agreement for details. This is the problem. Any idea? DaveA Giorgio You still didn't provide the full context. Are you trying to do store binary data, or not? Assuming you are, you could do the UUENCODE suggestion I made. Or use base64: base64.encodestring(/s/) wlll turn binary data into (larger) binary data, also considered a string. The latter is ASCII, so it's irrelevant whether it's considered utf-8 or otherwise. You store the resulting string in your database, and use base64.decodestring(s) to reconstruct your original. There's 50 other ways, some more efficient, but this may be the simplest. DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio, 05.03.2010 14:56: What i don't understand is why: s = uciao è ciao is converting a string to unicode, decoding it from the specified encoding but t = ciao è ciao t = unicode(t) That should do exactly the same instead of using the specified encoding always assume that if i'm not telling the function what the encoding is, i'm using ASCII. Is this a bug? Did you read the Unicode tutorial at the link I posted? Here's the link again: http://www.amk.ca/python/howto/unicode Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
2010/3/5 Dave Angel da...@ieee.org I'm not angry, and I'm sorry if I seemed angry. Tone of voice is hard to convey in a text message. Ok, sorry. I've misunderstood your mail :D I'm still not sure whether your confusion is to what the rules are, or why the rules were made that way. WHY the rules are made that way. But now it's clear. 2010/3/6 Mark Tolonen metolone+gm...@gmail.com metolone%2bgm...@gmail.com Maybe this will help: # coding: utf-8 a = ciao è ciao b = uciao è ciao.encode('latin-1') a is a UTF-8 string, due to #coding line in source. b is a latin-1 string, due to explicit encoding. Oh, right. And, if i'm not wrong B is an UTF8 string decoded to unicode (due to the coding: statement at the top of the file) and re-encoded to latin1 -Mark Thankyou again Giorgio -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Ok,so you confirm that: s = uciao è ciao will use the file specified encoding, and that t = ciao è ciao t = unicode(t) Will use, if not specified in the function, ASCII. It will ignore the encoding I specified on the top of the file. right? A literal u string, and only such a (unicode) literal string, is affected by the encoding specification. Once some bytes have been stored in a 8 bit string, the system does *not* keep track of where they came from, and any conversions then (even if they're on an adjacent line) will use the default decoder. This is a logical example of what somebody said earlier on the thread -- decode any data to unicode as early as possible, and deal only with unicode strings in the program. Then, if necessary, encode them into whatever output form immediately before (or while) outputting them. Ok Dave, What i don't understand is why: s = uciao è ciao is converting a string to unicode, decoding it from the specified encoding but t = ciao è ciao t = unicode(t) That should do exactly the same instead of using the specified encoding always assume that if i'm not telling the function what the encoding is, i'm using ASCII. Is this a bug? Giorgio -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: Ok,so you confirm that: s = uciao è ciao will use the file specified encoding, and that t = ciao è ciao t = unicode(t) Will use, if not specified in the function, ASCII. It will ignore the encoding I specified on the top of the file. right? A literal u string, and only such a (unicode) literal string, is affected by the encoding specification. Once some bytes have been stored in a 8 bit string, the system does *not* keep track of where they came from, and any conversions then (even if they're on an adjacent line) will use the default decoder. This is a logical example of what somebody said earlier on the thread -- decode any data to unicode as early as possible, and deal only with unicode strings in the program. Then, if necessary, encode them into whatever output form immediately before (or while) outputting them. Ok Dave, What i don't understand is why: s = uciao è ciao is converting a string to unicode, decoding it from the specified encoding but t = ciao è ciao t = unicode(t) That should do exactly the same instead of using the specified encoding always assume that if i'm not telling the function what the encoding is, i'm using ASCII. Is this a bug? Giorgio In other words, you don't understand my paragraph above. Once the string is stored in t as an 8 bit string, it's irrelevant what the source file encoding was. If you then (whether it's in the next line, or ten thousand calls later) try to convert to unicode without specifying a decoder, it uses the default encoder, which is a application wide thing, and not a source file thing. To see what it is on your system, use sys.getdefaultencoding(). There's an encoding specified or implied for each source file of an application, and they need not be the same. It affects string literals that come from that particular file. It does not affect any other conversions, as far as I know. For that matter, many of those source files may not even exist any more by the time the application is run. There are also encodings attached to each file object, I believe, though I've got no experience with that. So sys.stdout would have an encoding defined, and any unicode strings passed to it would be converted using that specification. The point is that there isn't just one global value, and it's a good thing. You should figure everywhere characters come into your program (eg. source files, raw_input, file i/o...) and everywhere characters go out of your program, and deal with each of them individually. Don't store anything internally as strings, and you won't create the ambiguity you have with your 't' variable above. DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
2010/3/5 Dave Angel da...@ieee.org In other words, you don't understand my paragraph above. Maybe. But please don't be angry. I'm here to learn, and as i've run into a very difficult concept I want to fully undestand it. Once the string is stored in t as an 8 bit string, it's irrelevant what the source file encoding was. Ok, you've said this 2 times, but, please, can you tell me why? I think that's the key passage to understand how encoding of strings works. The source file encoding affects all file lines, also strings. If my encoding is UTF8 python will read the string ciao è ciao as 'ciao \xc3\xa8 ciao' but if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant? I think the problem is that i can't find any difference between 2 lines quoted above: a = uciao è ciao and a = ciao è ciao a = unicode(a) If you then (whether it's in the next line, or ten thousand calls later) try to convert to unicode without specifying a decoder, it uses the default encoder, which is a application wide thing, and not a source file thing. To see what it is on your system, use sys.getdefaultencoding(). And this is ok. Spir said that it uses ASCII, you now say that it uses the default encoder. I think that ASCII on spir's system is the default encoder so. The point is that there isn't just one global value, and it's a good thing. You should figure everywhere characters come into your program (eg. source files, raw_input, file i/o...) and everywhere characters go out of your program, and deal with each of them individually. Ok. But it always happen this way. I hardly ever have to work with strings defined in the file. Don't store anything internally as strings, and you won't create the ambiguity you have with your 't' variable above. DaveA Thankyou Dave Giorgio -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: 2010/3/5 Dave Angel da...@ieee.org In other words, you don't understand my paragraph above. Maybe. But please don't be angry. I'm here to learn, and as i've run into a very difficult concept I want to fully undestand it. I'm not angry, and I'm sorry if I seemed angry. Tone of voice is hard to convey in a text message. Once the string is stored in t as an 8 bit string, it's irrelevant what the source file encoding was. Ok, you've said this 2 times, but, please, can you tell me why? I think that's the key passage to understand how encoding of strings works. The source file encoding affects all file lines, also strings. Nope, not strings. It only affects string literals. If my encoding is UTF8 python will read the string ciao è ciao as 'ciao \xc3\xa8 ciao' but if it's latin1 it will read 'ciao \xe8 ciao'. So, how can it be irrelevant? I think the problem is that i can't find any difference between 2 lines quoted above: s = uciao è ciao and t = ciao è ciao c = unicode(t) [** I took the liberty of making the variable names different so I can refer to them **] I'm still not sure whether your confusion is to what the rules are, or why the rules were made that way. The rules are that an unqualified conversion, such as the unicode() function with no second argument, uses the default encoding, in strict mode. Thus the error. Quoting the help: If no optional parameters are given, unicode() will mimic the behaviour of str() except that it returns Unicode strings instead of 8-bit strings. More precisely, if /object/ is a Unicode string or subclass it will return that Unicode string without any additional decoding applied. For objects which provide a __unicode__() ../reference/datamodel.html#object.__unicode__ method, it will call this method without arguments to create a Unicode string. For all other objects, the 8-bit string version or representation is requested and then converted to a Unicode string using the codec for the default encoding in 'strict' mode. As for why the rules are that, I'd have to ask you what you'd prefer. The unicode() function has no idea that t was created from a literal (and no idea what source file that literal was in), so it has to pick some coding, called the default coding. The designers decided to use a default encoding of ASCII, because manipulating ASCII strings is always safe, while many functions won't behave as expected when given UTF-8 encoded strings. For example, what's the 7th character of t ? That is not necessarily the same as the 7th character of s, since one or more of the characters in between might have taken up multiple bytes in s. That doesn't happen to be the case for your accented character, but would be for some other European symbols, and certainly for other languages as well. If you then (whether it's in the next line, or ten thousand calls later) try to convert to unicode without specifying a decoder, it uses the default encoder, which is a application wide thing, and not a source file thing. To see what it is on your system, use sys.getdefaultencoding(). And this is ok. Spir said that it uses ASCII, you now say that it uses the default encoder. I think that ASCII on spir's system is the default encoder so. I don't know, but I think it's the default in every country, at least on version 2.6. It might make sense to get some value from the OS that defined the locally preferred encoding, but then a program that worked fine in one locale might fail miserably in another. The point is that there isn't just one global value, and it's a good thing. You should figure everywhere characters come into your program (eg. source files, raw_input, file i/o...) and everywhere characters go out of your program, and deal with each of them individually. Ok. But it always happen this way. I hardly ever have to work with strings defined in the file. Not sure what you mean by the file. If you mean the source file, that's what your examples are about. If you mean a data file, that's dealt with differently. Don't store anything internally as strings, and you won't create the ambiguity you have with your 't' variable above. DaveA Thankyou Dave Giorgio ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio anothernetfel...@gmail.com wrote in message news:23ce85921003050915p1a084c0co73d973282d8fb...@mail.gmail.com... 2010/3/5 Dave Angel da...@ieee.org I think the problem is that i can't find any difference between 2 lines quoted above: a = uciao è ciao and a = ciao è ciao a = unicode(a) Maybe this will help: # coding: utf-8 a = ciao è ciao b = uciao è ciao.encode('latin-1') a is a UTF-8 string, due to #coding line in source. b is a latin-1 string, due to explicit encoding. a = unicode(a) b = unicode(b) Now what will happen? unicode() uses 'ascii' if not specified, because it has no idea of the encoding of a or b. Only the programmer knows. It does not use the #coding line to decide. #coding is *only* used to specify the encoding the source file is saved in, so when Python executes the script, reads the source and parses a literal Unicode string (u'...', u..., etc.) the bytes read from the file are decoded using the #coding specified. If Python parses a byte string ('...', ..., etc.) the bytes read from the file are stored directly in the string. The coding line isn't even used. The bytes will be exactly what was saved in the file between the quotes. -Mark ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Hi, For everybody who's having trouble understanding encoding, I found this page useful: http://evanjones.ca/python-utf8.html Cheers!! Albert-Jan ~~ In the face of ambiguity, refuse the temptation to guess. ~~ --- On Thu, 3/4/10, spir denis.s...@gmail.com wrote: From: spir denis.s...@gmail.com Subject: Re: [Tutor] Encoding To: tutor@python.org Date: Thursday, March 4, 2010, 8:01 AM On Wed, 3 Mar 2010 20:44:51 +0100 Giorgio anothernetfel...@gmail.com wrote: Please let me post the third update O_o. You can forgot other 2, i'll put them into this email. --- s = ciao è ciao print s ciao è ciao s.encode('utf-8') Traceback (most recent call last): File pyshell#2, line 1, in module s.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5: ordinal not in range(128) --- I am getting more and more confused. What you enter on the terminal prompt is text, encoded in a format (ascii, latin*, utf*,...) that probably depends on your system locale. As this format is always a sequence of bytes, python stores it as a plain str: s = ciao è ciao s,type(s) ('ciao \xc3\xa8 ciao', type 'str') My system is parametered in utf8. c3-a8 is the repr of 'é' in utf8. It needs 2 bytes because of the rules of utf8 itself. Right? To get a python unicode string, it must be decoded from its format, for me utf8: u = s.decode(utf8) u,type(u) (u'ciao \xe8 ciao', type 'unicode') e8 is the unicode code for 'è' (decimal 232). You can check that in tables. It needs here one byte only because 232255. [comparison with php] Ok, now, the point is: you (and the manual) said that this line: s = ugiorgio è giorgio will convert the string as unicode. Yes and no: it will convert it *into* a unicode string, in the sense of a python representation for universal text. When seeing u... , python will automagically *decode* the part in ..., taking as source format the one you indicate in a pseudo-comment on top of you code file, eg: # coding: utf8 Else I guess the default is the system's locale format? Or ascii? Someone knows? So, in my case ugiorgio è giorgio is equivalent to giorgio è giorgio.decode(utf8): u1 = ugiorgio è giorgio u2 = giorgio è giorgio.decode(utf8) u1,u2 (u'giorgio \xe8 giorgio', u'giorgio \xe8 giorgio') u1 == u2 True But also said that the part between will be encoded with my editor BEFORE getting encoded in unicode by python. will be encoded with my editor BEFORE getting encoded in unicode by python -- will be encoded *by* my editor BEFORE getting *decoded* *into* unicode by python So please pay attention to this example: My editor is working in UTF8. I create this: c = giorgio è giorgio // This will be an UTF8 string because of the file's encoding Right. d = unicode(c) // This will be an unicode string e = c.encode() // How will be encoded this string? If PY is working like PHP this will be an utf8 string. Have you tried it? c = giorgio è giorgio d = unicode(c) Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128) Now, tell us why! (the answer is below *) Can you help me? Thankyou VERY much Giorgio Denis (*) You don't tell which format the source string is encoded in. By default, python uses ascii (I know, it's stupid) which max code is 127. So, 'é' is not accepted. Now, if I give a format, all works fine: d = unicode(c,utf8) d u'giorgio \xe8 giorgio' Note: unicode(c,format) is an alias for c.decode(format): c.decode(utf8) u'giorgio \xe8 giorgio' la vita e estrany spir.wikidot.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Thankyou. You have clarificated many things in those emails. Due to high numbers of messages i won't quote everything. So, as i can clearly understand reading last spir's post, python gets strings encoded by my editor and to convert them to unicode i need to specify HOW they're encoded. This makes clear this example: c = giorgio è giorgio d = c.decode(utf8) I create an utf8 string, and to convert it into unicode i need to tell python that the string IS utf8. Just don't understand why in my Windows XP computer in Python IDLE doesn't work: RESTART c = giorgio è giorgio c 'giorgio \xe8 giorgio' d = c.decode(utf8) Traceback (most recent call last): File pyshell#10, line 1, in module d = c.decode(utf8) File C:\Python26\lib\encodings\utf_8.py, line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: invalid data In IDLE options i've set encoding to UTF8 of course. I also have some linux servers where i can try the IDLE but Putty doesn't seem to support UTF8. But, let's continue: In that example i've specified UTF8 in the decode method. If i hadn't set it python would have taken the one i specified in the second line of the file, right? As last point, i can't understand why this works: a = ugiorgio è giorgio a u'giorgio \xe8 giorgio' And this one doesn't: a = giorgio è giorgio b = unicode(a) Traceback (most recent call last): File pyshell#14, line 1, in module b = unicode(a) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 8: ordinal not in range(128) The second doesn't work because i have not told python how the string was encoded. But in the first too i haven't specified the encoding O_O. Thankyou again for your help. Giorgio ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On Thu, 4 Mar 2010 15:13:44 +0100 Giorgio anothernetfel...@gmail.com wrote: Thankyou. You have clarificated many things in those emails. Due to high numbers of messages i won't quote everything. So, as i can clearly understand reading last spir's post, python gets strings encoded by my editor and to convert them to unicode i need to specify HOW they're encoded. This makes clear this example: c = giorgio è giorgio d = c.decode(utf8) I create an utf8 string, and to convert it into unicode i need to tell python that the string IS utf8. Just don't understand why in my Windows XP computer in Python IDLE doesn't work: RESTART c = giorgio è giorgio c 'giorgio \xe8 giorgio' d = c.decode(utf8) Traceback (most recent call last): File pyshell#10, line 1, in module d = c.decode(utf8) File C:\Python26\lib\encodings\utf_8.py, line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10: invalid data How do you know your win XP terminal is configured to deal with text using utf8? Why do you think it should? Don't know much about windows, but I've read they have their own character sets (and format?). So, probably, if you haven't personalized it, it won't. (Conversely, I guess Macs use utf8 as default. Someone confirms?) In other words, c is not a piece of text in utf8. In IDLE options i've set encoding to UTF8 of course. I also have some linux servers where i can try the IDLE but Putty doesn't seem to support UTF8. But, let's continue: In that example i've specified UTF8 in the decode method. If i hadn't set it python would have taken the one i specified in the second line of the file, right? As last point, i can't understand why this works: a = ugiorgio è giorgio a u'giorgio \xe8 giorgio' This trial uses the default format of your system. It does the same as a = giorgio è giorgio.encode(default_format) It's a shorcut for ustring *literals* (constants), directly expressed by the programmer. In source code, it would use the format specified on top of the file. And this one doesn't: a = giorgio è giorgio b = unicode(a) Traceback (most recent call last): File pyshell#14, line 1, in module b = unicode(a) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 8: ordinal not in range(128) This trial uses ascii because you give no format (yes, it can be seen as a flaw). It does the same as a = giorgio è giorgio.encode(ascii) The second doesn't work because i have not told python how the string was encoded. But in the first too i haven't specified the encoding O_O. Thankyou again for your help. Giorgio Denis -- la vita e estrany spir.wikidot.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
2010/3/4 spir denis.s...@gmail.com How do you know your win XP terminal is configured to deal with text using utf8? Why do you think it should? I think there is an option in IDLE configuration to set this. So, if my entire system is not utf8 i can't use the IDLE for this test? This trial uses the default format of your system. It does the same as a = giorgio è giorgio.encode(default_format) It's a shorcut for ustring *literals* (constants), directly expressed by the programmer. In source code, it would use the format specified on top of the file. This trial uses ascii because you give no format (yes, it can be seen as a flaw). It does the same as a = giorgio è giorgio.encode(ascii) Ok,so you confirm that: s = uciao è ciao will use the file specified encoding, and that t = ciao è ciao t = unicode(t) Will use, if not specified in the function, ASCII. It will ignore the encoding I specified on the top of the file. right? Again, thankyou. I'm loving python and his community. Giorgio -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: 2010/3/4 spir denis.s...@gmail.com snip Ok,so you confirm that: s = uciao è ciao will use the file specified encoding, and that t = ciao è ciao t = unicode(t) Will use, if not specified in the function, ASCII. It will ignore the encoding I specified on the top of the file. right? A literal u string, and only such a (unicode) literal string, is affected by the encoding specification. Once some bytes have been stored in a 8 bit string, the system does *not* keep track of where they came from, and any conversions then (even if they're on an adjacent line) will use the default decoder. This is a logical example of what somebody said earlier on the thread -- decode any data to unicode as early as possible, and deal only with unicode strings in the program. Then, if necessary, encode them into whatever output form immediately before (or while) outputting them. Again, thankyou. I'm loving python and his community. Giorgio ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio, 03.03.2010 09:36: i am looking for more informations about encoding in python: i've read that Amazon SimpleDB accepts every string encoded in UTF-8. How can I encode a string? byte_string = unicode_string.encode('utf-8') If you use unicode strings throughout your application, you will be happy with the above. Note that this is an advice, not a condition. And, what's the default string encoding in python? default encodings are bad, don't rely on them. Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: i am looking for more informations about encoding in python: i've read that Amazon SimpleDB accepts every string encoded in UTF-8. How can I encode a string? And, what's the default string encoding in python? I think the safest way is to use unicode strings in your application and convert them to byte strings if needed, using the encode and decode methods. the other question is about mysql DB: if i have a mysql field latin1 and extract his content in a python script, how can I handle it? if you have a byte string s encoded in 'latin1' you can simply call: s.decode('latin1') to get the unicode string. thankyou Giorgio Patrick ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
byte_string = unicode_string.encode('utf-8') If you use unicode strings throughout your application, you will be happy with the above. Note that this is an advice, not a condition. Mmm ok. So all strings in the app are unicode by default? Do you know if there is a function/method i can use to check encoding of a string? default encodings are bad, don't rely on them. No, ok, it was just to understand what i'm working with. Patrick, ok. I should check if it's possible to save unicode strings in the DB. Do you think i'd better set my db to utf8? I don't need latin1, it's just the default value. Thankyou Giorgio -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Oh, sorry, let me update my last post: if i have a string, let's say: s = hi giorgio; and want to store it in a latin1 db, i need to convert it to latin1 before storing, right? 2010/3/3 Giorgio anothernetfel...@gmail.com byte_string = unicode_string.encode('utf-8') If you use unicode strings throughout your application, you will be happy with the above. Note that this is an advice, not a condition. Mmm ok. So all strings in the app are unicode by default? Do you know if there is a function/method i can use to check encoding of a string? default encodings are bad, don't rely on them. No, ok, it was just to understand what i'm working with. Patrick, ok. I should check if it's possible to save unicode strings in the DB. Do you think i'd better set my db to utf8? I don't need latin1, it's just the default value. Thankyou Giorgio -- -- AnotherNetFellow Email: anothernetfel...@gmail.com -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Mmm ok. So all strings in the app are unicode by default? Depends on your python version. If you use python 2.x, you have to use a u before the string: s = u'Hallo World' Do you know if there is a function/method i can use to check encoding of a string? AFAIK such a function doesn't exist. Python3 solves this by using unicode strings by default. Patrick, ok. I should check if it's possible to save unicode strings in the DB. It is more an issue of your database adapter, than of your database. Do you think i'd better set my db to utf8? I don't need latin1, it's just the default value. I think the encoding of the db doesn't matter much in this case, but I would prefer utf-8 over latin-1. If you get an utf-8 encoded raw byte string you call .decode('utf-8'). In case of an latin-1 encoded string you call .decode('latin1') Thankyou Giorgio - Patrick ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio, 03.03.2010 15:50: Depends on your python version. If you use python 2.x, you have to use a u before the string: s = u'Hallo World' Ok. So, let's go back to my first question: s = u'Hallo World' is unicode in python 2.x - ok Correct. s = 'Hallo World' how is encoded? Depends on your source code encoding. http://www.python.org/dev/peps/pep-0263/ Well, the problem comes, i.e when i'm getting a string from an HTML form with POST. I don't and can't know the encoding, right? It depends on browser. The browser will tell you the encoding in the headers that it transmits. Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: Depends on your python version. If you use python 2.x, you have to use a u before the string: s = u'Hallo World' Ok. So, let's go back to my first question: s = u'Hallo World' is unicode in python 2.x - ok s = 'Hallo World' how is encoded? I am not 100% sure, but I think it depends on the encoding of your source file or the coding you specify. See PEP 263 http://www.python.org/dev/peps/pep-0263/ Well, the problem comes, i.e when i'm getting a string from an HTML form with POST. I don't and can't know the encoding, right? It depends on browser. Right, but you can do something about it. Tell the browser, which encoding you are going to accept: form ... accept-charset=UTF-8 ... /form - Patrick ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Uff, encoding is a very painful thing in programming. Ok so now comes last layer of the encoding: the webserver. I now know how to handle encoding in a python app and in interactions with the db, but the last step is sending the content to the webserver. How should i encode pages? The encoding i choose has to be the same than the one i choose in the .htaccess file? Or maybe i can send content encoded how i like more to apache and it re-encodes in the right way all pages? Thankyou 2010/3/3 Patrick Sabin patrick.just4...@gmail.com Giorgio wrote: Depends on your python version. If you use python 2.x, you have to use a u before the string: s = u'Hallo World' Ok. So, let's go back to my first question: s = u'Hallo World' is unicode in python 2.x - ok s = 'Hallo World' how is encoded? I am not 100% sure, but I think it depends on the encoding of your source file or the coding you specify. See PEP 263 http://www.python.org/dev/peps/pep-0263/ Well, the problem comes, i.e when i'm getting a string from an HTML form with POST. I don't and can't know the encoding, right? It depends on browser. Right, but you can do something about it. Tell the browser, which encoding you are going to accept: form ... accept-charset=UTF-8 ... /form - Patrick -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio wrote: Depends on your python version. If you use python 2.x, you have to use a u before the string: s = u'Hallo World' Ok. So, let's go back to my first question: s = u'Hallo World' is unicode in python 2.x - ok s = 'Hallo World' how is encoded? Since it's a quote literal in your source code, it's encoded by your text editor when it saves the file, and you tell Python which encoding it was by the second line of your source file, right after the shebang line. A sequence of bytes in an html file should be should have its encoding identified by the tag at the top of the html file. And I'd *guess* that on a form result, the encoding can be assumed to match that of the html of the form itself. DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Ok. So, how do you encode .py files? UTF-8? 2010/3/3 Dave Angel da...@ieee.org Giorgio wrote: Depends on your python version. If you use python 2.x, you have to use a u before the string: s = u'Hallo World' Ok. So, let's go back to my first question: s = u'Hallo World' is unicode in python 2.x - ok s = 'Hallo World' how is encoded? Since it's a quote literal in your source code, it's encoded by your text editor when it saves the file, and you tell Python which encoding it was by the second line of your source file, right after the shebang line. A sequence of bytes in an html file should be should have its encoding identified by the tag at the top of the html file. And I'd *guess* that on a form result, the encoding can be assumed to match that of the html of the form itself. DaveA -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Ops, i have another update: string = ublabla This is unicode, ok. Unicode UTF-8? Thankyou 2010/3/3 Giorgio anothernetfel...@gmail.com Ok. So, how do you encode .py files? UTF-8? 2010/3/3 Dave Angel da...@ieee.org Giorgio wrote: Depends on your python version. If you use python 2.x, you have to use a u before the string: s = u'Hallo World' Ok. So, let's go back to my first question: s = u'Hallo World' is unicode in python 2.x - ok s = 'Hallo World' how is encoded? Since it's a quote literal in your source code, it's encoded by your text editor when it saves the file, and you tell Python which encoding it was by the second line of your source file, right after the shebang line. A sequence of bytes in an html file should be should have its encoding identified by the tag at the top of the html file. And I'd *guess* that on a form result, the encoding can be assumed to match that of the html of the form itself. DaveA -- -- AnotherNetFellow Email: anothernetfel...@gmail.com -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Giorgio, 03.03.2010 18:28: string = ublabla This is unicode, ok. Unicode UTF-8? No, not UTF-8. Unicode. You may want to read this: http://www.amk.ca/python/howto/unicode Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
Please let me post the third update O_o. You can forgot other 2, i'll put them into this email. --- s = ciao è ciao print s ciao è ciao s.encode('utf-8') Traceback (most recent call last): File pyshell#2, line 1, in module s.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5: ordinal not in range(128) --- I am getting more and more confused. I was coding in PHP and was saving some strings in the DB. Was using utf8_encode to encode them before sending to the utf8_unicode_ci table. Ok, the result was that strings were double encoded. To fix that I simply removed the utf8_encode() function and put the raw data in the database (that converts them in utf8). In other words, in PHP, I can encode a string multiple times: $c = giorgio è giorgio; $c = utf8_encode($c); // this will work in an utf8 html page $d = utf8_encode($c); // this won't work, will print a strange letter $d = utf8_decode($d); // this will work. will print an utf8 string Ok, now, the point is: you (and the manual) said that this line: s = ugiorgio è giorgio will convert the string as unicode. But also said that the part between will be encoded with my editor BEFORE getting encoded in unicode by python. So please pay attention to this example: My editor is working in UTF8. I create this: c = giorgio è giorgio // This will be an UTF8 string because of the file's encoding d = unicode(c) // This will be an unicode string e = c.encode() // How will be encoded this string? If PY is working like PHP this will be an utf8 string. Can you help me? Thankyou VERY much Giorgio ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
I'm sorry, it's utf8_unicode_ci that's confusing me. So, UTF-8 is one of the most commonly used encodings. UTF stands for Unicode Transformation Format UTF8 is, we can say, a type of unicode, right? And what about utf8_unicode_ci in mysql? Giorgio 2010/3/3 Stefan Behnel stefan...@behnel.de Giorgio, 03.03.2010 18:28: string = ublabla This is unicode, ok. Unicode UTF-8? No, not UTF-8. Unicode. You may want to read this: http://www.amk.ca/python/howto/unicode Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor -- -- AnotherNetFellow Email: anothernetfel...@gmail.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
(Don't top-post. Put your response below whatever you're responding to, or at the bottom.) Giorgio wrote: Ok. So, how do you encode .py files? UTF-8? 2010/3/3 Dave Angel da...@ieee.org I personally use Komodo to edit my python source files, and tell it to use UTF8 encoding. Then I add a encoding line as the second line of the file. Many times I get lazy, because mostly my source doesn't contain non-ASCII characters. But if I'm copying characters from an email or other Unicode source, then I make sure both are set up. The editor will actually warn me if I try to save a file as ASCII with any 8 bit characters in it. Note: unicode is 16 bit characters, at least in CPython implementation. UTF-8 is an 8 bit encoding of that Unicode, where there's a direct algorithm to turn 16 or even 32 bit Unicode into 8 bit characters. They are not the same, although some people use the terms interchangeably. Also note: An 8 bit string has no inherent meaning, until you decide how to decode it into Unicode. Doing explicit decodes is much safer, rather than assuming some system defaults. And if it happens to contain only 7 bit characters, it doesn't matter what encoding you specify when you decode it. Which is why all of us have been so casual about this. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On 3 March 2010 20:44, Giorgio anothernetfel...@gmail.com wrote: s = ciao è ciao print s ciao è ciao s.encode('utf-8') Traceback (most recent call last): File pyshell#2, line 1, in module s.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5: ordinal not in range(128) It is confusing but once understand how it works it makes sense. You start with a 8bit string so you will want to *decode* it to unicode string. s = ciao è ciao us = s.decode('latin-1') us u'ciao \xe8 ciao' us2 = s.decode('iso-8859-1') us2 u'ciao \xe8 ciao' Greets Sander ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On 3 March 2010 22:41, Sander Sweers sander.swe...@gmail.com wrote: It is confusing but once understand how it works it makes sense. I remembered Kent explained it very clear in [1]. Greets Sander [1] http://mail.python.org/pipermail/tutor/2009-May/068920.html ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On Wed, 3 Mar 2010 16:32:01 +0100 Giorgio anothernetfel...@gmail.com wrote: Uff, encoding is a very painful thing in programming. For sure, but it's true for any kind of data, not only text :-) Think at music or images *formats*. The issue is a bit obscured for text but the use of the mysterious, _cryptic_ (!), word encoding. When editing an image using a software tool, there is a live representation of the image in memory (say, a plain pixel 2D array), which is probably what the developper found most practicle for image processing. [text processing in python: unicode string type] When the job is finished, you can choose between various formats (png, gif, jpeg..) to save and or transfer it. [text: utf-8/16/32, iso-8859-*, ascii...]. Conversely, if you to edit an existing image, the software needs to convert back from the file format into its internal representation; the format need to be indicated in file, or by the user, or guessed. The only difference with text is that there is no builtin image or sound representation _type_ in python -- only because text and sound are domain specific data while text is needed everywhere. Denis -- la vita e estrany spir.wikidot.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding
On Wed, 3 Mar 2010 20:44:51 +0100 Giorgio anothernetfel...@gmail.com wrote: Please let me post the third update O_o. You can forgot other 2, i'll put them into this email. --- s = ciao è ciao print s ciao è ciao s.encode('utf-8') Traceback (most recent call last): File pyshell#2, line 1, in module s.encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5: ordinal not in range(128) --- I am getting more and more confused. What you enter on the terminal prompt is text, encoded in a format (ascii, latin*, utf*,...) that probably depends on your system locale. As this format is always a sequence of bytes, python stores it as a plain str: s = ciao è ciao s,type(s) ('ciao \xc3\xa8 ciao', type 'str') My system is parametered in utf8. c3-a8 is the repr of 'é' in utf8. It needs 2 bytes because of the rules of utf8 itself. Right? To get a python unicode string, it must be decoded from its format, for me utf8: u = s.decode(utf8) u,type(u) (u'ciao \xe8 ciao', type 'unicode') e8 is the unicode code for 'è' (decimal 232). You can check that in tables. It needs here one byte only because 232255. [comparison with php] Ok, now, the point is: you (and the manual) said that this line: s = ugiorgio è giorgio will convert the string as unicode. Yes and no: it will convert it *into* a unicode string, in the sense of a python representation for universal text. When seeing u... , python will automagically *decode* the part in ..., taking as source format the one you indicate in a pseudo-comment on top of you code file, eg: # coding: utf8 Else I guess the default is the system's locale format? Or ascii? Someone knows? So, in my case ugiorgio è giorgio is equivalent to giorgio è giorgio.decode(utf8): u1 = ugiorgio è giorgio u2 = giorgio è giorgio.decode(utf8) u1,u2 (u'giorgio \xe8 giorgio', u'giorgio \xe8 giorgio') u1 == u2 True But also said that the part between will be encoded with my editor BEFORE getting encoded in unicode by python. will be encoded with my editor BEFORE getting encoded in unicode by python -- will be encoded *by* my editor BEFORE getting *decoded* *into* unicode by python So please pay attention to this example: My editor is working in UTF8. I create this: c = giorgio è giorgio // This will be an UTF8 string because of the file's encoding Right. d = unicode(c) // This will be an unicode string e = c.encode() // How will be encoded this string? If PY is working like PHP this will be an utf8 string. Have you tried it? c = giorgio è giorgio d = unicode(c) Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128) Now, tell us why! (the answer is below *) Can you help me? Thankyou VERY much Giorgio Denis (*) You don't tell which format the source string is encoded in. By default, python uses ascii (I know, it's stupid) which max code is 127. So, 'é' is not accepted. Now, if I give a format, all works fine: d = unicode(c,utf8) d u'giorgio \xe8 giorgio' Note: unicode(c,format) is an alias for c.decode(format): c.decode(utf8) u'giorgio \xe8 giorgio' la vita e estrany spir.wikidot.com ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding question
On Wed, Sep 9, 2009 at 5:06 AM, Oleg Oltar oltarase...@gmail.com wrote: Hi! One of my tests returned following text () The test: from django.test.client import Client c = Client() resp = c.get(/) resp.content In [25]: resp.content Out[25]: '\r\n\r\n\r\n!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.0 Strict//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd;\r\n\r\nhtml xmlns=http://www.w3.org/1999/xhtml;\r\n head\r\n meta http-equiv=content-type content=text/html; charset=utf-8 /\r\n \r\n \ntitleJapanese innovation | \xd0\xaf\xd0\xbf\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xb8\xd0\xbd\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xb0\xd1\x86\xd0\xb8\xd0\xb8/title\n\r\n snip Is there a way I can convert it to normal readable text? (I need for example to find a string of text in this response to check if my test case Pass or failed) resp.content.decode('string_escape') will convert it to encoded bytes. Then another decode() with the correct encoding will get you Unicode. I'm not sure what the correct encoding is for the second decode(), most likely one of 'utf-8', 'utf_16_le' or 'utf_16_be'. Kent ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding and Decoding
Carlos wrote: The genetic algorithm that Im using (GA) generates solutions for a given problem, expressed in a list, this list is composed by integers. Every element in the list takes 8 integers, is a little messy but this is because List [0] = Tens X position List [1] = Units X position List [2] = Decimals X position List [3] = If than 5 the number is negative, else is positive Then if the result is List = [6, 1, 2, 3] the X position equals -612.3. This is the same for the Y position. If there are 10 elements the list is going to be 80 integers long and if there are 100 elements, well you get a very long list... With this in mind my question would be, how can I keep track of this information? I mean how can I assign this List positions to each element? This is needed because this is going to be a long list and the GA needs to evaluate the position of each element with respect to the position of the other elements. So it needs to know that certain numbers are related to certain element and it needs to have access to the size, level, name and parent information... I hope that this is clear enough. I will assume there is a good reason for storing the coordinates in this form... Do the numbers have to be all in a single list? I would start by breaking it up into lists of four, so if you have 10 elements you would have a list of 20 small lists. It might make sense to pair the x and y lists so you have a list of 10 lists of 2 lists of 4 numbers, e.g. [ [ [6, 1, 2, 3], [7, 2, 8, 4] ], ...] Another thing to consider is whether you might want to make a class to hold the coordinate values, then you could refer to x.tens, x.units, x.decimal, x.sign by name. If you need a single list for the GA to work, one alternative would be to make converters between the nested representation and the flat one. Alternately you could wrap the list in a class which provides helpful accessors. HTH Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding and XML troubles
William O'Higgins Witteman wrote: I've been struggling with encodings in my XML input to Python programs. Here's the situation - my program has no declared encoding, so it defaults to ASCII. It's written in Unicode, but apparently that isn't confusing to the parser. Fine by me. I import some XML, probably encoded in the Windows character set (I don't remember what that's called now). I can read it for the most part - but it throws exceptions when it hits accented characters (some data is being input by French speakers). I am using ElementTree for my XML parsing What I'm trying to do is figure out what I need to do to get my program to not barf when it hits an accented character. I've tried adding an encoding line as suggested here: http://www.python.org/dev/peps/pep-0263/ What these do is make the program fail to parse the XML at all. Has anyone encountered this? Suggestions? Thanks. As Luke says, the encoding of your program has nothing to do with the encoding of the XML or the types of data your program will accept. PEP 263 only affects the encoding of string literals in your program. It sounds like your XML is not well-formed. XML files can have an encoding declaration *in the XML*. If it in not present, the file is assumed to be in UTF-8 encoding. If your XML is in Cp1252 but lacks a correct encoding declaration, it is not valid XML because the Cp1252 characters are not valid UTF-8. Try including the line ?xml version=1.0 encoding=windows-1252? or ?xml version=1.0 encoding=Cp1252? as the first line of the XML. (windows-1252 is the official IANA-registered name for Cp1252; I'm not sure which name will actually work correctly.) Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding and XML troubles
For what it's worth, the vast majority of the XML out there (especially if you're parsing RSS feeds, etc.) is written by monkeys and is totally ill-formed. It seems the days of 'it looked OK in my browser' are still here. To find out if it's your app or the XML, you could try running the XML through a validating parser. There are also various tools out there which might be able to parse the XML anyway -- xmllint, I believe, can do this. Dustin (not by *any* stretch an expert on XML *or* Unicode) ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encoding and XML troubles
Inputting XML into a Python program has nothing to do with what encoding the python source is in.So it seems to me that that particular PEP doesn't apply in this case at all.I'm guessing that the ElementTree module has an option to use Unicode input. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding text in html
submits: We\xe2\x80\x99re pretty surthis is how it is stored in postgresplease help me outthanks- Original Message From: anil maran [EMAIL PROTECTED]To: tutor@python.orgSent: Wednesday, September 13, 2006 12:14:10 AMSubject: encoding text in html i was trying to display some text it is in utf-8 in postgres and when it is displayed in firefox and ie, it gets displayed as some symols with 4numbers in a box or so even for ' apostrophe please tell me how to display this properly i try title.__str__ or title.__repr__ both dont work___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding text in html
「ひぐらしのなく頃に」30秒TVCF風ver.0.1 this is how it is getting displayed in browser- Original Message From: anil maran [EMAIL PROTECTED]To: anil maran [EMAIL PROTECTED]Sent: Wednesday, September 13, 2006 2:07:55 AMSubject: Re: [Tutor] encoding text in htmlsubmits: We\xe2\x80\x99re pretty surthis is how it is stored in postgresplease help me outthanks- Original Message From: anil maran [EMAIL PROTECTED]To: tutor@python.orgSent: Wednesday, September 13, 2006 12:14:10 AMSubject: [Tutor] encoding text in html i was trying to display some text it is in utf-8 in postgres and when it is displayed in firefox and ie, it gets displayed as some symols with 4numbers in a box or so even for ' apostrophe please tell me how to display this properly i try title.__str__ or title.__repr__ both dont work___Tutor maillist-Tutor@python.orghttp://mail.python.org/mailman/listinfo/tutor___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding text in html
anil maran wrote: 「ひぐらしのなく頃に」30秒TVCF風ver.0.1 http://youtube.com/?v=0WmeTRcAiec this is how it is getting displayed in browser I'm pretty sure that is not how We\xe2\x80\x99re displays; can you show an example of the same text as it is stored and as it displays? Kent - Original Message From: anil maran [EMAIL PROTECTED] To: anil maran [EMAIL PROTECTED] Sent: Wednesday, September 13, 2006 2:07:55 AM Subject: Re: [Tutor] encoding text in html submits: We\xe2\x80\x99re pretty sur this is how it is stored in postgres please help me out thanks - Original Message From: anil maran [EMAIL PROTECTED] To: tutor@python.org Sent: Wednesday, September 13, 2006 12:14:10 AM Subject: [Tutor] encoding text in html i was trying to display some text it is in utf-8 in postgres and when it is displayed in firefox and ie, it gets displayed as some symols with 4numbers in a box or so even for ' apostrophe please tell me how to display this properly i try title.__str__ or title.__repr__ both dont work ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding text in html
anil maran wrote: i was trying to display some text it is in utf-8 in postgres and when it is displayed in firefox and ie, it gets displayed as some symols with 4numbers in a box or so even for ' apostrophe please tell me how to display this properly i try title.__str__ or title.__repr__ both dont work Do you have the page encoding set to utf-8 in Firefox? You can do this with View / Character Encoding as a test. If it displays correctly when you set the encoding then you should include a meta tag in the HTML that sets the charset. Put this in the head of the HTML: meta http-equiv=content-type content=text/html; charset=utf-8 / Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding text in html
On Wed, 13 Sep 2006, anil maran wrote: i was trying to display some text it is in utf-8 in postgres and when it is displayed in firefox and ie, it gets displayed as some symols with 4numbers in a box or so even for ' apostrophe please tell me how to display this properly i try title.__str__ I'm assuming that you're dynamically generating some HTML document. If so, have you declared the document encoding in the HTML file to be utf-8? See: http://www.joelonsoftware.com/articles/Unicode.html Do you have a small sample of the HTML file that's being generated? One of us here may want to inspect it to make sure you really are generating UTF-8 output. You may also want to show the Python code you've written to generate the output. Try to give us enough information so we can attempt to reproduce what you're seeing. Good luck! ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding
Jose P wrote: watch this example: a=['lula', 'cação'] print a ['lula', 'ca\xc3\xa7\xc3\xa3o'] print a[1] cação when i print the list the special characters are not printed correctly! When you print a list, it uses repr() to format the contents of the list; when you print an item directly, str() is used. For a string containing non-ascii characters, the results are different. But if i print only the list item that has the special charaters it runs OK. How do i get list print correctly? You will have to do the formatting your self. A simple solution might be for x in a: print x If you want exactly the list formatting you have to work harder. Try something like [' + ', '.join([str(x) for x in a]) + '] Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] encoding
kakada wrote: LookupError: unknown encoding: ANSI so what is the correct way to do it? stringinput.encode('latin_1') works for me. Do a Google search for Python encodings, and you will find what the right names for the encodings are. http://docs.python.org/lib/standard-encodings.html Hugo ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor