Thomas W wrote: > I'm getting really annoyed with python in regards to > unicode/ascii-encoding problems. > > The string below is the encoding of the norwegian word "fødselsdag". > >>>> s = 'f\xc3\x83\xc2\xb8dselsdag' > > I stored the string as "fødselsdag" but somewhere in my code it got > translated into the mess above and I cannot get the original string > back. It cannot be printed in the console or written a plain text-file. > I've tried to convert it using > >>>> s.encode('iso-8859-1') > Traceback (most recent call last): > File "<interactive input>", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: > ordinal not in range(128) > >>>> s.encode('utf-8') > Traceback (most recent call last): > File "<interactive input>", line 1, in ? > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: > ordinal not in range(128) > > And nothing helps. I cannot remember hacing these problems in earlier > versions of python and it's really annoying, even if it's my own fault > somehow, handling of normal characters like this shouldn't cause this > much hassle. Searching google for "codec can't decode byte" and > UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm > not alone.
You would want .decode() (which converts a byte string into a Unicode string), not .encode() (which converts a Unicode string into a byte string). You get UnicodeDecodeErrors even though you are trying to .encode() because whenever Python is expecting a Unicode string but gets a byte string, it tries to decode the byte string as 7-bit ASCII. If that fails, then it raises a UnicodeDecodeError. However, I don't know of an encoding that takes u"fødselsdag" to 'f\xc3\x83\xc2\xb8dselsdag'. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list