Ok, I've cleaned up my code abit and it seems as if I've encoded/decoded myself into a corner ;-). My understanding of unicode has room for improvement, that's for sure. I got some pointers and initial code-cleanup seem to have removed some of the strange results I got, which several of you also pointed out.
Anyway, thanks for all your replies. I think I can get this thing up and running with a bit more code tinkering. And I'll read up on some unicode-docs as well. :-) Thanks again. Thomas John Machin wrote: > Thomas W wrote: > > I'm getting really annoyed with python in regards to > > unicode/ascii-encoding problems. > > > > The string below is the encoding of the norwegian word "fødselsdag". > > > > >>> s = 'f\xc3\x83\xc2\xb8dselsdag' > > There is no such thing as "*the* encoding" of any given string. > > > > > I stored the string as "fødselsdag" but somewhere in my code it got > > translated into the mess above and I cannot get the original string > > back. > > Somewhere in your code??? Can't you track through your code to see > where it is being changed? Failing that, can't you show us your code so > that we can help you? > > I have guessed *what* you got, but *how* you got it boggles the mind: > > The effect is the same as (decode from latin1 to Unicode, encode as > utf8) *TWICE*. That's how you change one byte in the original to *FOUR* > bytes in the "mess": > > | >>> orig = 'f\xf8dselsdag' > | >>> orig.decode('latin1').encode('utf8') > | 'f\xc3\xb8dselsdag' > | >>> > orig.decode('latin1').encode('utf8').decode('latin1').encode('utf8') > | 'f\xc3\x83\xc2\xb8dselsdag' > | >>> > > > It cannot be printed in the console or written a plain text-file. > > Incorrect. *Any* string can be printed on the console or written to a > file. What you mean is that when you look at the output, it is not what > you want. > > > I've tried to convert it using > > > > >>> s.encode('iso-8859-1') > > Traceback (most recent call last): > > File "<interactive input>", line 1, in ? > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: > > ordinal not in range(128) > > encode is an attribute of unicode objects. If applied to a str object, > the str object is converted to unicode first using the default codec > (typically ascii). > > s.encode('iso-8859-1') is effectively > s.decode('ascii').encode('iso-8859-1'), and s.decode('ascii') fails for > the (obvious(?)) reason given. > > > > > >>> s.encode('utf-8') > > Traceback (most recent call last): > > File "<interactive input>", line 1, in ? > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: > > ordinal not in range(128) > > Same story as for 'iso-8859-1' > > > > > And nothing helps. I cannot remember hacing these problems in earlier > > versions of python > > I would be very surprised if you couldn't reproduce your problem on any > 2.n version of Python. > > > and it's really annoying, even if it's my own fault > > somehow, handling of normal characters like this shouldn't cause this > > much hassle. Searching google for "codec can't decode byte" and > > UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm > > not alone. > > > > Any hints? > > 1. Read the Unicode howto: http://www.amk.ca/python/howto/unicode > 2. Read the Python documentation on .decode() and .encode() carefully. > 3. Show us your code so that we can help you avoid the double > conversion to utf8. Tell us what IDE you are using. > 4. Tell us what you are trying to achieve. Note that if all you are > trying to do is read and write text in Norwegian (or any other language > that's representable in iso-8859-1 aka latin1), then you don't have to > do anything special at all in your code-- this is the good old "legacy" > way of doing things universally in vogue before Unicode was invented! > > HTH, > John -- http://mail.python.org/mailman/listinfo/python-list