On Sat, Oct 9, 2010 at 4:59 PM, Brian Blais <bbl...@bryant.edu> wrote: > This may be a stemming from my complete ignorance of unicode, but when I do > this (Python 2.6): > > s='\xc2\xa9 2008 \r\n' > > and I want the ascii version of it, ignoring any non-ascii chars, I thought I > could do: > > s.encode('ascii','ignore') > > but it gives the error: > > In [20]:s.encode('ascii','ignore') > ---------------------------------------------------------------------------- > UnicodeDecodeError Traceback (most recent call last) > > /Users/bblais/python/doit100810a.py in <module>() > ----> 1 > 2 > 3 > 4 > 5 > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: > ordinal not in range(128) > > am I doing something stupid here?
In addition to Benjamin's explanation: Unicode strings in Python are of type `unicode` and written with a leading "u"; e.g. u"A unicode string for ¥500". Byte strings lack the leading "u"; e.g. "A plain byte string". Note that "Unicode string" does not refer to strings which have been encoded using a Unicode encoding (e.g. UTF-8); such strings are still byte strings, for encodings emit bytes. As to why you got the /exact/ error you did: As a backward compatibility hack, in order to satisfy your nonsensical encoding request, Python implicitly tried to decode the byte string `s` using ASCII as a default (the choice of ASCII here has nothing to do with the fact that you specified ASCII in your encoding request), so that it could then try and encode the resulting unicode string; hence why you got a Unicode*De*codeError as opposed to a Unicode*En*codeError, despite the fact you called *en*code(). Highly suggested further reading: "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html Cheers, Chris -- http://mail.python.org/mailman/listinfo/python-list