[Thomas Thomas] | how can I find the encoding to use to open a file.. I have a | file with "£" chararcter.. | is there some utility function in python that I can use | | how can I know which encoding to use
[This is going to be a longer answer than you really want. The short answer is "probably iso-8859-1 but there's no way of being certain without trying it out".] The general answer to "how can I know which encoding to use for an arbitrary text file?" is that: you can't. The more helpful answer is that there are various heuristics (polite term for "good guessing algorithms") which will help you out. I believe that the latest BeautifulSoup has one: http://www.crummy.com/software/BeautifulSoup/ and I'm sure there are others. To be certain, though, you need to be told -- somehow -- what encoding was in use when the file was saved. However, that's not quite what you're asking. You say you have a file with a "£" character. But what does that mean? Ultimately, that you have some text in a file, one character of which you expect to display as a pound sign (that's a British pound sign, not the # which Americans bizarrely call a pound sign ;). Someone, somewhere, got this pound sign into a file. Maybe it was from a text editor, maybe through a database. However it happened, the application saved its data to disk using some encoding. If it was a naive tool (non-unicode-aware) then it was probably ASCII with some kind of extension above the 7-bit mark. iso-8859-1 / latin-1 (same thing) often cope with that. If the app was unicode-aware, it'll be a specific unicode encoding. Quite possibly utf-8. To experiment, pick the necessary byte/bytes out of your text stream and compare with a few encodings: <dump> import sys from unicodedata import name # # This is, for example, your original "pound sign" # bytes = "\x9c" # # This is what we're aiming for: what unicode # thinks of as a pound sign # print name (u"£") # -> POUND SIGN # # Let's try ascii # print name (bytes.decode ("ascii")) # # Whoops! # -> UnicodeDecodeError: 'ascii' codec can't decode byte 0x9c in position 0: ordinal not in range(128) # # iso-8859-1 / latin-1 # print name (bytes.decode ("iso-8859-1")) # # Still not right # -> ValueError: no such name # # Cheating, slightly... # print name (bytes.decode (sys.stdin.encoding)) # # Bingo! # -> POUND SIGN print sys.stdin.encoding # -> cp437 print sys.stdout.encoding # -> cp437 </dump> So in this case it was cp437 (since I got the bytes from typing "£" into the interpreter, something I can do on my keyboard. You might well find it was some other encoding. If this doesn't take you anwhere -- or you don't understand it -- try dumping a bit of your data into an email and posting it. If nothing else, someone will probably be able to tell you what encoding you need! TJG ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ -- http://mail.python.org/mailman/listinfo/python-list