Re: [Tutor] String encoding
Think about it this way... if I gave you a block of data as hex bytes: 240F91BC03...FF90120078CD45 and then asked you whether that was a bitmap image or a sound file or something else, how could you tell? It's just *bytes*, it could be anything. Yes, but if you give me data and then tell me it is a sound file then I might be able to reverse engineer or reconstruct it. I know what the character does/should look like. I just need the equivalent to the ASCII table for the various encodings; once I have the table I can compare different characters at \311 and see if they are the correct character. I have not been able to find an encoding table (other than ASCII). Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String encoding
On Thu, Aug 25, 2011 at 7:07 PM, Prasad, Ramit ramit.pra...@jpmorgan.com wrote: Nice catch! Yeah, I am stuck on the encoding mechanism as well. I know how to encode/decode...but not what encoding to use. Is there a reference that I can look up to find what encoding that would correspond to? I know what the character looks like if that helps. I know that Python does display the correct character sometimes, but not sure when or why. In this case, the encoding is almost certainly latin-1. I know that from playing around at the interactive interpreter, like this: s = 'M\xc9XICO' print s.decode('latin-1') MÉXICO If you want to see charts of various encodings, wikipedia has a bunch. For instance, the Latin-1 encoding is here: http://en.wikipedia.org/wiki/ISO/IEC_8859-1 and UTF-8 is here: http://en.wikipedia.org/wiki/Utf-8 As the other respondents have said, it's really hard to figure this out just in code. The chardet module mentioned by Steven D'Aprano is probably the best bet if you really *have* to guess the encoding of an arbitrary sequence of bytes, but it much, much better to actually know the encoding of your inputs. Good luck! -- Jerry ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String encoding
In this case, the encoding is almost certainly latin-1. I know that from playing around at the interactive interpreter, like this: s = 'M\xc9XICO' print s.decode('latin-1') MÉXICO If you want to see charts of various encodings, wikipedia has a bunch. For instance, the Latin-1 encoding is here: http://en.wikipedia.org/wiki/ISO/IEC_8859-1 and UTF-8 is here: http://en.wikipedia.org/wiki/Utf-8 Yep, it is. Thanks those charts are exactly what I wanted! Now I have another question. What is the difference between what print shows and what the interpreter shows? print s.decode('latin-1') MÉXICO s.decode('latin-1') u'M\xc9XICO' print repr(s) 'M\xc9XICO' repr(s) 'M\\xc9XICO' Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String encoding
Prasad, Ramit wrote: Think about it this way... if I gave you a block of data as hex bytes: 240F91BC03...FF90120078CD45 and then asked you whether that was a bitmap image or a sound file or something else, how could you tell? It's just *bytes*, it could be anything. Yes, but if you give me data and then tell me it is a sound file then I might be able to reverse engineer or reconstruct it. I know what the character does/should look like. I just need the equivalent to the ASCII table for the various encodings; once I have the table I can compare different characters at \311 and see if they are the correct character. I have not been able to find an encoding table (other than ASCII). In practice, you can often guess the encoding by trying the most common ones (such as Latin-1 and UTF-8) and seeing if the strings you get make sense. But note that more than one encoding may give sensible results for a specific string: b = 'M\311XICO' # byte-string print b.decode('latin-1') MÉXICO print b.decode('iso 8859-9') # Turkish MÉXICO So was M\311XICO encoded using the Latin-1 or Turkish encoding, or something else? There is no way to tell. Many encodings overlap. If you have arbitrary byte-strings, and no context to tell what makes sense, then all bets are off. Just because something *can* be decoded doesn't make it meaningful: b = '...\xf7...' print b.decode('macroman') ...˜... print b.decode('latin-1') ...÷... Which is the right encoding to use and which string is intended? So guessing can sometimes work, but guesses can be wrong because encodings overlap. In general, you must know the encoding to be sure. But if you have to guess, try to guess using the largest byte-string that you can. Python 2.7 comes with 108 encodings: http://docs.python.org/library/codecs.html#standard-encodings Since anyone can define their own encoding, there is no upper limit to the number of encodings, and no promise that Python will include them all. There are even two joke encodings, invented for April's Fool Day, that use nine-bit nonets instead of eight-bit octets (bytes): UTF-9 and UTF-18. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String encoding
On 08/26/2011 11:49 AM, Prasad, Ramit wrote: snip Yep, it is. Thanks those charts are exactly what I wanted! Now I have another question. What is the difference between what print shows and what the interpreter shows? print s.decode('latin-1') MÉXICO The decoded characters are a Unicode string. Python prints that string by encoding it according to whatever sys.stdout is defaulted to. If that matches your actual terminal, then you see it properly. s.decode('latin-1') u'M\xc9XICO' Here, because you don't assign it to anything, the interpreter is printing a repr() of the object. print repr(s) 'M\xc9XICO' Here your code is doing the same thing, but explicitly this time. repr(s) 'M\\xc9XICO' Here, the repr() is created (which is a string containing single quotes), but then you don't print it, you just leave it. So the interpreter shows you the repr() of that object, enclosing it in double quotes for simplicity. -- DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] String encoding
I have a string question for Python2. Basically I have two strings with non-ASCII characters and I would like to have a better understanding of what the escapes are from and how to possibly remove/convert/encode the string to something else. If the description of my intended action is vague it is because my intent at this point is vague until I understand the situation better. ' M\xc9XICO' and ' M\311XICO' Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String encoding
On 25/08/11 15:36, Prasad, Ramit wrote: I have a string question for Python2. Basically I have two strings with non-ASCII characters and I would like to have a better understanding of what the escapes are from ' M\xc9XICO' and ' M\311XICO' I don't know what they are from but they are both the same value, one in hex and one in octal. 0xC9 == 0311 As for the encoding mechanisms I'm afraid I can't help there! HTH -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String encoding
I don't know what they are from but they are both the same value, one in hex and one in octal. 0xC9 == 0311 As for the encoding mechanisms I'm afraid I can't help there! Nice catch! Yeah, I am stuck on the encoding mechanism as well. I know how to encode/decode...but not what encoding to use. Is there a reference that I can look up to find what encoding that would correspond to? I know what the character looks like if that helps. I know that Python does display the correct character sometimes, but not sure when or why. Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String encoding
Prasad, Ramit wrote: I don't know what they are from but they are both the same value, one in hex and one in octal. 0xC9 == 0311 As for the encoding mechanisms I'm afraid I can't help there! Nice catch! Yeah, I am stuck on the encoding mechanism as well. I know how to encode/decode...but not what encoding to use. Is there a reference that I can look up to find what encoding that would correspond to? I know what the character looks like if that helps. I know that Python does display the correct character sometimes, but not sure when or why. In general, no. The same byte value (0xC9) could correspond to many different encodings. In general, you *must* know what the encoding is in order to tell how to decode the bytes. Think about it this way... if I gave you a block of data as hex bytes: 240F91BC03...FF90120078CD45 and then asked you whether that was a bitmap image or a sound file or something else, how could you tell? It's just *bytes*, it could be anything. All is not quite lost though. You could try decoding the bytes and see what you get, and see if it makes sense. Start with ASCII, Latin-1, UTF-8, UTF-16 and any other encodings in common use. (This would be like pretending the bytes were a bitmap, and looking at it, and trying to decide whether it looked like an actual picture or like a bunch of random pixels. Hopefully it wasn't meant to look like a bunch of random pixels.) Web browsers such as Internet Explorer and Mozilla will try to guess the encoding by doing frequency analysis of the bytes. Mozilla's encoding guesser has been ported to Python: http://chardet.feedparser.org/ But any sort of guessing algorithm is just a nasty hack. You are always better off ensuring that you accurately know the encoding. -- Steven ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] string encoding
On 06/18/10 14:21, Rick Pasotto wrote: Remember, even if your terminal display is restricted to ASCII, you can still use Beautiful Soup to parse, process, and write documents in UTF-8 and other encodings. You just can't print certain strings with print. I can print the string fine. It's f.write(string_with_unicode) that fails with: UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) Shouldn't I be able to f.write() *any* 8bit byte(s)? repr() gives: uRealtors\\xc2\\xae BTW, I'm running python 2.5.5 on debian linux. The FAQ explains half of it, except that in your case, substitute what it says about terminal with file object. Python plays it safe and does not implicitly encode a unicode string when writing into a file. If you have a unicode string and you want to .write() that unicode string to a file, you need to .encode() the string first, so: string_with_unicode = uRealtors\xc2\xae f.write(string_with_unicode.encode('utf-8')) otherwise, you can use the codecs module to wrap the file object: f = codecs.open('filename.txt', 'w', encoding=utf-8) f.write(string_with_unicode) # now you can send unicode string to f ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] string encoding
Rick Pasotto wrote: snip I can print the string fine. It's f.write(string_with_unicode) that fails with: UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) Shouldn't I be able to f.write() *any* 8bit byte(s)? repr() gives: uRealtors\\xc2\\xae BTW, I'm running python 2.5.5 on debian linux. You can write any 8 bit string. But you have a Unicode string, which is 16 or 32 bits per character. To write it to a file, it must be encoded, and the default encoder is ASCII. The cure is to encode it yourself, using the encoding that your spec calls for. I'll assume utf8 below: name = uRealtors\xc2\xae repr(name) u'Realtors\\xc2\\xae' outfile = open(junk.txt, w) outfile.write(name) Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordin al not in range(128) outfile.write(name.encode(utf8)) outfile.close() DaveA ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] string encoding
I'm using BeautifulSoup to process a webpage. One of the fields has a unicode character in it. (It's the 'registered trademark' symbol.) When I try to write this string to another file I get this error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) In the interpreter the offending string portion shows as: 'Realtors\xc2\xae'. How can I deal with this single string? The rest of the document works fine. -- Freedom can't be kept for nothing. If you set a high value on liberty, you must set a low value on everything else. -- Lucius Annaeus Seneca, 65 A.D. Rick Pasottor...@niof.nethttp://www.niof.net ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] string encoding
On 06/18/10 06:41, Rick Pasotto wrote: I'm using BeautifulSoup to process a webpage. One of the fields has a unicode character in it. (It's the 'registered trademark' symbol.) When I try to write this string to another file I get this error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) In the interpreter the offending string portion shows as: 'Realtors\xc2\xae'. How can I deal with this single string? The rest of the document works fine. You need to tell BeautifulSoup the encoding of the HTML document. You can encode this information in either the: - (preferred) Encoding is specified externally from HTTP Header ContentType declaration, e.g.: Content-Type: text/html; charset=utf-8 - HTML ContentType declaration: e.g. meta http-equiv=Content-Type content=text/html; charset=utf-8 - XML declaration -- for XHTML document used for parsing using XML parser (hint: BeautifulSoup isn't XML/XHTML parser), e.g.: ?xml version=1.0 encoding=utf-8? However, BeautifulSoup will also uses some heuristics to *guess* the encoding of a tag soup that doesn't have a proper encoding. So, the most likely reason is this, from Beautiful Soup's FAQ: http://www.crummy.com/software/BeautifulSoup/documentation.html#Why can't Beautiful Soup print out the non-ASCII characters I gave it? Why can't Beautiful Soup print out the non-ASCII characters I gave it? If you're getting errors that say: 'ascii' codec can't encode character 'x' in position y: ordinal not in range(128), the problem is probably with your Python installation rather than with Beautiful Soup. Try printing out the non-ASCII characters without running them through Beautiful Soup and you should have the same problem. For instance, try running code like this: latin1word = 'Sacr\xe9 bleu!' unicodeword = unicode(latin1word, 'latin-1') print unicodeword If this works but Beautiful Soup doesn't, there's probably a bug in Beautiful Soup. However, if this doesn't work, the problem's with your Python setup. Python is playing it safe and not sending non-ASCII characters to your terminal. There are two ways to override this behavior. 1. The easy way is to remap standard output to a converter that's not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal. import codecs import sys streamWriter = codecs.lookup('utf-8')[-1] sys.stdout = streamWriter(sys.stdout) codecs.lookup returns a number of bound methods and other objects related to a codec. The last one is a StreamWriter object capable of wrapping an output stream. 2. The hard way is to create a sitecustomize.py file in your Python installation which sets the default encoding to ISO-Latin-1 or to UTF-8. Then all your Python programs will use that encoding for standard output, without you having to do something for each program. In my installation, I have a /usr/lib/python/sitecustomize.py which looks like this: import sys sys.setdefaultencoding(utf-8) For more information about Python's Unicode support, look at Unicode for Programmers or End to End Unicode Web Applications in Python. Recipes 1.20 and 1.21 in the Python cookbook are also very helpful. Remember, even if your terminal display is restricted to ASCII, you can still use Beautiful Soup to parse, process, and write documents in UTF-8 and other encodings. You just can't print certain strings with print. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] string encoding
On Fri, Jun 18, 2010 at 12:24:25PM +1000, Lie Ryan wrote: On 06/18/10 06:41, Rick Pasotto wrote: I'm using BeautifulSoup to process a webpage. One of the fields has a unicode character in it. (It's the 'registered trademark' symbol.) When I try to write this string to another file I get this error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) In the interpreter the offending string portion shows as: 'Realtors\xc2\xae'. How can I deal with this single string? The rest of the document works fine. You need to tell BeautifulSoup the encoding of the HTML document. You can encode this information in either the: - (preferred) Encoding is specified externally from HTTP Header ContentType declaration, e.g.: Content-Type: text/html; charset=utf-8 - HTML ContentType declaration: e.g. meta http-equiv=Content-Type content=text/html; charset=utf-8 The document has: meta http-equiv=Content-Type content=text/html; charset=iso-8859-1 When I look at the document in vim and when I 'print' in python I see the two characters of an acented capital A and the circled 'r'. latin1word = 'Sacr\xe9 bleu!' unicodeword = unicode(latin1word, 'latin-1') print unicodeword TypeError: decoding Unicode is not supported If this works but Beautiful Soup doesn't, there's probably a bug in Beautiful Soup. However, if this doesn't work, the problem's with your Python setup. Python is playing it safe and not sending non-ASCII characters to your terminal. There are two ways to override this behavior. 1. The easy way is to remap standard output to a converter that's not afraid to send ISO-Latin-1 or UTF-8 characters to the terminal. import codecs import sys streamWriter = codecs.lookup('utf-8')[-1] sys.stdout = streamWriter(sys.stdout) codecs.lookup returns a number of bound methods and other objects related to a codec. The last one is a StreamWriter object capable of wrapping an output stream. Those four lines executed but I still get TypeError: decoding Unicode is not supported Remember, even if your terminal display is restricted to ASCII, you can still use Beautiful Soup to parse, process, and write documents in UTF-8 and other encodings. You just can't print certain strings with print. I can print the string fine. It's f.write(string_with_unicode) that fails with: UnicodeEncodeError: 'ascii' codec can't encode characters in position 31-32: ordinal not in range(128) Shouldn't I be able to f.write() *any* 8bit byte(s)? repr() gives: uRealtors\\xc2\\xae BTW, I'm running python 2.5.5 on debian linux. -- Making fun of born-again christians is like hunting dairy cows with a high powered rifle and scope. -- P.J. O'Rourke Rick Pasottor...@niof.nethttp://www.niof.net ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] String Encoding problem
Hey everyone, I'm hoping someone here can help me solve an odd problem (bug?). I'm having trouble with string encoding, object deletion, and the xml.etree library. If this isn't the right list to be posting this question, please let me know. I'm new to Python and don't know of any other help me Python mailing lists. I have tried debugging this ad-infinitem. Anyway, at the bottom of this e-mail you will find the code of a python file. This is a gross over-simplification of my code, with little exception handling so that the errors are obvious. Running this interactively, if you finish off with 'del db', it exits fine and creates a skeleton xml file called 'db.xml' with text 'root /'. However, if you instead CTRL-D, it throws at exception while quitting and then leaves an empty 'db.xml' which won't work. Can anyone here help me figure out why this is? Stuff I've done: I've traced this down to the self.commit() call in __del__. The stacktrace and a few print statements injected into xml.etree leads me to the call 'root'.encode('us-ascii') throwing a LookupError on line 751 of xml.etree.ElementTree. This makes no sense to me, since it works fine normally. Thank you very much. Any and all help or pointers are appreciated. ~Matt db.py ### from xml.etree import ElementTree as ET import os class Database(object): def __init__(self, path): self.__dbpath = path## Path to the database self.load() def __del__(self): ## FIXME: Known bug: ## del db at command line works properly ## Ctrl-D, when there is no db file present, results in a LookupError ##and empty xml file from StringIO import StringIO from traceback import print_exc trace = StringIO() try: print 5 self.commit() print 7 except Exception: print_exc(100, trace) print trace.getvalue() def load(self): if os.path.exists(self.__dbpath): self.root = ET.parse(self.__dbpath).getroot() else: self.root = ET.Element(root) def commit(self): ET.ElementTree(self.root).write(self.__dbpath) db = Database('db.xml') ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String Encoding problem
Le Mon, 20 Apr 2009 10:46:47 -0400, Matt hellzfury+pyt...@gmail.com s'exprima ainsi: Hey everyone, I'm hoping someone here can help me solve an odd problem (bug?). I'm having trouble with string encoding, object deletion, and the xml.etree library. If this isn't the right list to be posting this question, please let me know. I'm new to Python and don't know of any other help me Python mailing lists. I have tried debugging this ad-infinitem. Anyway, at the bottom of this e-mail you will find the code of a python file. This is a gross over-simplification of my code, with little exception handling so that the errors are obvious. Running this interactively, if you finish off with 'del db', it exits fine and creates a skeleton xml file called 'db.xml' with text 'root /'. However, if you instead CTRL-D, it throws at exception while quitting and then leaves an empty 'db.xml' which won't work. Can anyone here help me figure out why this is? Stuff I've done: I've traced this down to the self.commit() call in __del__. The stacktrace and a few print statements injected into xml.etree leads me to the call 'root'.encode('us-ascii') throwing a LookupError on line 751 of xml.etree.ElementTree. This makes no sense to me, since it works fine normally. Thank you very much. Any and all help or pointers are appreciated. ~Matt db.py ### from xml.etree import ElementTree as ET import os class Database(object): def __init__(self, path): self.__dbpath = path## Path to the database self.load() def __del__(self): ## FIXME: Known bug: ## del db at command line works properly ## Ctrl-D, when there is no db file present, results in a LookupError ##and empty xml file from StringIO import StringIO from traceback import print_exc trace = StringIO() try: print 5 self.commit() print 7 except Exception: print_exc(100, trace) print trace.getvalue() def load(self): if os.path.exists(self.__dbpath): self.root = ET.parse(self.__dbpath).getroot() else: self.root = ET.Element(root) def commit(self): ET.ElementTree(self.root).write(self.__dbpath) db = Database('db.xml') Actually, it all runs well for me -- after the following modification: def __del__(self): ## FIXME: Known bug: ## del db at command line works properly ## Ctrl-D, when there is no db file present, results in a LookupError ##and empty xml file try: print 5 self.commit() print 7 except Exception: raise Notes: * I don't know for what reason you needed such a complicated traceback construct. * Before I did this modif, I indeed had a weird exception about stringIO. * __del__() seems to do the contrary: it writes back to file through commit()??? * del db works fine, anyway * When I run without any bd.xml, it properly creates one with text root /. * When I run with an ampty db.xml, I have the following exception message: Traceback (most recent call last): File xmlTree.py, line 29, in module db = Database('db.xml') File xmlTree.py, line 10, in __init__ self.load() File xmlTree.py, line 24, in load self.root = ET.parse(self.__dbpath).getroot() File /usr/lib/python2.5/xml/etree/ElementTree.py, line 862, in parse tree.parse(source, parser) File /usr/lib/python2.5/xml/etree/ElementTree.py, line 587, in parse self._root = parser.close() File /usr/lib/python2.5/xml/etree/ElementTree.py, line 1254, in close self._parser.Parse(, 1) # end of data xml.parsers.expat.ExpatError: no element found: line 2, column 0 5 Exception exceptions.AttributeError: AttributeError('Database' object has no attribute 'root',) in bound method Database.__del__ of __main__.Database object at 0xb7e78fec ignored -- la vita e estrany ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String Encoding problem
From: spir denis.s...@free.fr Date: Mon, 20 Apr 2009 12:22:59 -0500 To: Python Tutor tutor@python.org Subject: Re: [Tutor] String Encoding problem Le Mon, 20 Apr 2009 10:46:47 -0400, Matt hellzfury+pyt...@gmail.com s'exprima ainsi: Hey everyone, I'm hoping someone here can help me solve an odd problem (bug?). I'm having trouble with string encoding, object deletion, and the xml.etree library. If this isn't the right list to be posting this question, please let me know. I'm new to Python and don't know of any other help me Python mailing lists. I have tried debugging this ad-infinitem. Anyway, at the bottom of this e-mail you will find the code of a python file. This is a gross over-simplification of my code, with little exception handling so that the errors are obvious. Running this interactively, if you finish off with 'del db', it exits fine and creates a skeleton xml file called 'db.xml' with text 'root /'. However, if you instead CTRL-D, it throws at exception while quitting and then leaves an empty 'db.xml' which won't work. Can anyone here help me figure out why this is? Stuff I've done: I've traced this down to the self.commit() call in __del__. The stacktrace and a few print statements injected into xml.etree leads me to the call 'root'.encode('us-ascii') throwing a LookupError on line 751 of xml.etree.ElementTree. This makes no sense to me, since it works fine normally. Thank you very much. Any and all help or pointers are appreciated. ~Matt db.py ### from xml.etree import ElementTree as ET import os class Database(object): def __init__(self, path): self.__dbpath = path## Path to the database self.load() def __del__(self): ## FIXME: Known bug: ## del db at command line works properly ## Ctrl-D, when there is no db file present, results in a LookupError ##and empty xml file from StringIO import StringIO from traceback import print_exc trace = StringIO() try: print 5 self.commit() print 7 except Exception: print_exc(100, trace) print trace.getvalue() def load(self): if os.path.exists(self.__dbpath): self.root = ET.parse(self.__dbpath).getroot() else: self.root = ET.Element(root) def commit(self): ET.ElementTree(self.root).write(self.__dbpath) db = Database('db.xml') Actually, it all runs well for me -- after the following modification: def __del__(self): ## FIXME: Known bug: ## del db at command line works properly ## Ctrl-D, when there is no db file present, results in a LookupError ##and empty xml file try: print 5 self.commit() print 7 except Exception: raise I must be missing something I run the following code (in DB.py) without any other files in the current directory: from xml.etree import ElementTree as ET import os class Database(object): def __init__(self, path):self.dbpath = path## Path to the database self.load()def __del__(self):try:print 5 self.commit()print 7except Exception:raise def load(self):if os.path.exists(self.dbpath):self.root = ET.parse(self.dbpath).getroot()else:self.root = ET.Element(root)def commit(self): ET.ElementTree(self.root).write(self.dbpath) db = Database('db.xml') Output: 5 Exception LookupError: LookupError('unknown encoding: us-ascii',) in bound method Database.__del__ of __main__.Database object at 0x87870 ignored If you're not getting the same output, please let me know what your environment is. Perhaps this is an implementation difference across platforms. Notes: * I don't know for what reason you needed such a complicated traceback construct. That was only to demonstrate the error. Without that, you see a LookupError without any trace. * Before I did this modif, I indeed had a weird exception about stringIO. Top-level imports are not consistently available in __del__. That shouldn't be necessary with the code I have above. * __del__() seems to do the contrary: it writes back to file through commit()??? Yes, I know. In my actual code, there is a flag that is set when certain run-time conditions are met or when the user wants the DB to be saved on quit. Most of the time, however, modifications to the database need to be done in memory because they are not intended to be saved. * del db works fine, anyway * When I run without any bd.xml, it properly creates one with text root /. * When I run with an ampty db.xml, I have the following exception message: Traceback (most recent call last): File xmlTree.py, line 29, in module db
Re: [Tutor] String Encoding problem
Matt wrote: Hey everyone, I'm hoping someone here can help me solve an odd problem (bug?). I'm having trouble with string encoding, object deletion, and the xml.etree library. If this isn't the right list to be posting this question, please let me know. I'm new to Python and don't know of any other help me Python mailing lists. I have tried debugging this ad-infinitem. Anyway, at the bottom of this e-mail you will find the code of a python file. This is a gross over-simplification of my code, with little exception handling so that the errors are obvious. Running this interactively, if you finish off with 'del db', it exits fine and creates a skeleton xml file called 'db.xml' with text 'root /'. However, if you instead CTRL-D, it throws at exception while quitting and then leaves an empty 'db.xml' which won't work. Can anyone here help me figure out why this is? Stuff I've done: I've traced this down to the self.commit() call in __del__. The stacktrace and a few print statements injected into xml.etree leads me to the call 'root'.encode('us-ascii') throwing a LookupError on line 751 of xml.etree.ElementTree. This makes no sense to me, since it works fine normally. The environment available to __del__ methods during program termination is wonky, and apparently not very consistent either. I can't say that I completely understand it myself, perhaps someone else can provide a better explanation for both of us, but some of the causes are described in the documentation: http://docs.python.org/reference/datamodel.html#object.__del__ What is your rationale for using __del__? Are you trying to force a 'commit()' call on Database instances when your program terminates -- in the case of an unhandled exception, for example? HTH, Marty Thank you very much. Any and all help or pointers are appreciated. ~Matt db.py ### from xml.etree import ElementTree as ET import os class Database(object): def __init__(self, path): self.__dbpath = path## Path to the database self.load() def __del__(self): ## FIXME: Known bug: ## del db at command line works properly ## Ctrl-D, when there is no db file present, results in a LookupError ##and empty xml file from StringIO import StringIO from traceback import print_exc trace = StringIO() try: print 5 self.commit() print 7 except Exception: print_exc(100, trace) print trace.getvalue() def load(self): if os.path.exists(self.__dbpath): self.root = ET.parse(self.__dbpath).getroot() else: self.root = ET.Element(root) def commit(self): ET.ElementTree(self.root).write(self.__dbpath) db = Database('db.xml') ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String Encoding problem
On Mon, Apr 20, 2009 at 10:46 AM, Matt hellzfury+pyt...@gmail.com wrote: Running this interactively, if you finish off with 'del db', it exits fine and creates a skeleton xml file called 'db.xml' with text 'root /'. However, if you instead CTRL-D, it throws at exception while quitting and then leaves an empty 'db.xml' which won't work. Can anyone here help me figure out why this is? Stuff I've done: I've traced this down to the self.commit() call in __del__. The stacktrace and a few print statements injected into xml.etree leads me to the call 'root'.encode('us-ascii') throwing a LookupError on line 751 of xml.etree.ElementTree. This makes no sense to me, since it works fine normally. Please show the exact error message and stack trace when you post errors, it can be very helpful. What you are doing with __del__ is unusual and not common practice. A better way to ensure cleanup is to use a close() method which a client must call, or to use a context manager and 'with' statement. I think the reason your code is failing is because some module needed by the encode() call has already been unloaded before your __del__() method is called. Thank you very much. Any and all help or pointers are appreciated. If you defined a close() method, you could write client code like this: from contextlib import closing with closing(Database('db.xml')) as db: # do something with db # when this block exits db will be closed It's also not too hard to make an openDatabase() function so you could write with (openDatabase('db.xml')) as db: # etc though that is not really a beginner challenge. Some notes and further pointers here: http://personalpages.tds.net/~kent37/kk/00015.html Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] String Encoding problem
Sorry about that. Hopefully this is better: In operator: def __init__(self, path, saveDB=True, cleanUp=True): '''Constructor''' ## Calculate filesystem paths self.WORK_DIR= path + '.tmp' DB_PATH= path + '.xml' self.SAVE_DB= saveDB## finish(): Delete unnecessary files created by run? self.CLEANUP= cleanUp## finish(): Delete database at end of run? ## Make sure we have a working directory (exception on failed write) if not os.path.isdir(self.WORK_DIR): os.mkdir(self.WORK_DIR) self._db = DB.Database(DB_PATH) ## SOME OTHER ENVIRONMENT SETUP STUFF def _cleanUpEnvironment(self): '''Delete temp files created for this run''' try: for path,dirs,files in os.walk(self.WORK_DIR, topdown=False): for f in files:os.unlink(os.path.join(path,f)) for d in dirs:os.rmdir(os.path.join(path,d)) os.rmdir(self.WORK_DIR) except: print sys.stderr, 'Could not delete temp files; left at:' print sys.stderr, self.WORK_DIR def finish(self): '''Clean up and finish the run (write out to the database)''' if self.SAVE_DB:self._db.commit() if self.CLEANUP:self._cleanUpEnvironment() def __del__(self): ## FIXME: Known bug: ## del t at command line works properly ## Ctrl-D, when there is no db file present, results in a LookupError self.finish() if __name__ == '__main__': printHelp() ## Provide tab completion to the user import readline, rlcompleter readline.parse_and_bind('tab: complete') t= OperatorClassName(os.path.splitext(__file__)[0]) In database: def __init__(self, path): '''Constructor''' self.__dbpath = path## Path to the database self.load() def load(self): '''Read the database out from the file''' from xml.parsers.expat import ExpatError if os.path.exists(self.__dbpath): ## Noticed exceptions: IOError, ExpatError try: self.root = ET.parse(self.__dbpath).getroot() except ExpatError: raise ExpatError('Invalid XML in ' + self.__dbpath) else: self.root = ET.Element(root) def commit(self): '''Write the database back to the file''' ## Noticed exceptions: IOError ET.ElementTree(self.root).write(self.__dbpath) -- ~Matthew Strax-Haber National Aeronautics and Space Administration Langley Research Center (LaRC) Co-op, Safety-Critical Avionics Systems Branch W: 757-864-7378; C: 561-704-0029 Mail Stop 130 matthew.strax-ha...@nasa.gov From: Martin Walsh mwa...@mwalsh.org Date: Mon, 20 Apr 2009 16:05:01 -0500 To: Python Tutor tutor@python.org Cc: Strax-Haber, Matthew (LARC-D320) matthew.strax-ha...@nasa.gov Subject: Re: [Tutor] String Encoding problem Forwarding to the list. Matt, perhaps you can repost in plain text, my mail client seems to have mangled your source ... Strax-Haber, Matthew (LARC-D320) wrote: *From: *Martin Walsh mwa...@mwalsh.org The environment available to __del__ methods during program termination is wonky, and apparently not very consistent either. I can't say that I completely understand it myself, perhaps someone else can provide a better explanation for both of us, but some of the causes are described in the documentation: http://docs.python.org/reference/datamodel.html#object.__del__ What is your rationale for using __del__? Are you trying to force a 'commit()' call on Database instances when your program terminates -- in the case of an unhandled exception, for example? Perhaps I oversimplified a bit. In my actual code, there is a database class and an operator class. The actual structure is this: In operator: def __init__(self, path, saveDB=True, cleanUp=True): '''Constructor'''## Calculate filesystem paths self.WORK_DIR= path + '.tmp'DB_PATH= path + '.xml'self.SAVE_DB= saveDB## finish(): Delete unnecessary files created by run?self.CLEANUP= cleanUp## finish(): Delete database at end of run?## Make sure we have a working directory (exception on failed write)if not os.path.isdir(self.WORK_DIR):os.mkdir(self.WORK_DIR) self._db = DB.Database(DB_PATH) ## SOME OTHER ENVIRONMENT SETUP STUFF def _cleanUpEnvironment(self): try:## Delete temp files created for this runfor path,dirs,files in os.walk(self.WORK_DIR, topdown=False):for f in files: os.unlink(os.path.join(path,f))for d in dirs: os.rmdir(os.path.join(path,d))os.rmdir(self.WORK_DIR) except:print sys.stderr, 'Could not delete temp files; left at:'print sys.stderr, self.WORK_DIRdef finish(self):'''Clean up and finish the run (write out to the database)'''if self.SAVE_DB
Re: [Tutor] String Encoding problem
I've solved the problem by passing on the work of deciding when to commit to client code. This isn't ideal but it will do what is necessary and unfortunately I don't have any more time to dedicate to this. I hate not being able to find a reasonable workaround :/. -- ~Matthew Strax-Haber National Aeronautics and Space Administration Langley Research Center (LaRC) Co-op, Safety-Critical Avionics Systems Branch W: 757-864-7378; C: 561-704-0029 Mail Stop 130 matthew.strax-ha...@nasa.gov From: Kent Johnson ken...@tds.net Date: Mon, 20 Apr 2009 16:55:16 -0500 To: Strax-Haber, Matthew (LARC-D320) matthew.strax-ha...@nasa.gov Cc: Python Tutor tutor@python.org Subject: Re: [Tutor] String Encoding problem Can you give us a simple description of what you are trying to do? And if you can post in plain text instead of HTML that would be helpful. Maybe this will give you some ideas - you can trap the control-D and do your cleanup: http://openbookproject.net/pybiblio/tips/wilson/simpleExceptions.php Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor