On Sat, Oct 9, 2010 at 7:59 PM, Brian Blais <bbl...@bryant.edu> wrote: > This may be a stemming from my complete ignorance of unicode, but when I do > this (Python 2.6): > > s='\xc2\xa9 2008 \r\n' > > and I want the ascii version of it, ignoring any non-ascii chars, I thought I > could do: > > s.encode('ascii','ignore') > > but it gives the error: > > In [20]:s.encode('ascii','ignore') > ---------------------------------------------------------------------------- > UnicodeDecodeError Traceback (most recent call last) > > /Users/bblais/python/doit100810a.py in <module>() > ----> 1 > 2 > 3 > 4 > 5 > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: > ordinal not in range(128) > > am I doing something stupid here? > > of course, as a workaround, I can do: ''.join([c for c in s if ord(c)<128]) > > but I thought the encode call should work. > > thanks, > bb >
Encode takes a Unicode string (made up of code points) and turns it into a byte string (a sequence of bytes). In your case, you don't have a Unicode string. You have a byte string. In order to encode that sequence of bytes into a different encoding, you have to first figure out what those bytes mean (decode it). Python has no way of knowing that your strings are UTF-8 so it just tries ascii as the default. You can either decode the byte string explicitly or (if it's actually a literal in your code) just specify it as a Unicode string. s = u'\u00a9 2008' s.encode('ascii','ignore') The encode vs. decode confusion was removed in Python 3: byte strings don't have an encode method and unicode strings don't have a decode method. > -- > Brian Blais > bbl...@bryant.edu > http://web.bryant.edu/~bblais > http://bblais.blogspot.com/ > > > > -- > http://mail.python.org/mailman/listinfo/python-list > -- http://mail.python.org/mailman/listinfo/python-list