Re: Unicode/ascii encoding nightmare
John Machin wrote: Indeed yourself. What does the above mean ? Have you ever considered reading posts in chronological order, or reading all posts in a thread? I do no think people read posts in chronological order; it simply doesn't make sense. I also don't think many do read threads completely, but only until the issue is clear or boredom kicks in. Your nice double whammy post was enough to clarify what happened to the OP, I just wanted to make a bit more explicit what you meant; my poor english also made me understand that you were just suspecting such an error, so I verified and posted the result. That your suspect was a sarcastic remark could be clear only when reading the timewise former reply that however happened to be lower in the thread tree in my newsreader; fact that pushed it into the not worth reading area. It might help you avoid writing posts with non-zero information content. Why should I *avoid* writing posts with *non-zero* information content ? Double whammy on negation or still my poor english kicking in ? :-) Suppose you didn't post the double whammy message, and suppose someone else made it seven minutes later than your other post. I suppose that in this case the message would be a zero content noise (and not the precious pearl of wisdom it is because it comes from you). Cheers, John Andrea -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Thomas W wrote: Ok, I've cleaned up my code abit and it seems as if I've encoded/decoded myself into a corner ;-). Yes, you may encounter situations where you have some string, you decode it (ie. convert it to Unicode) using one character encoding, but then you later encode it (ie. convert it back to a plain string) using a different character encoding. This isn't a problem on its own, but if you then take that plain string and attempt to convert it to Unicode again, using the same input encoding as before, you'll be misinterpreting the contents of the string. This round tripping of character data is typical of Web applications: you emit a Web page in one encoding, the fields in the forms are represented in that encoding, and upon form submission you receive this data. If you then process the form data using a different encoding, you're misinterpreting what you previously emitted, and when you emit this data again, you compound the error. My understanding of unicode has room for improvement, that's for sure. I got some pointers and initial code-cleanup seem to have removed some of the strange results I got, which several of you also pointed out. Converting to Unicode for processing is a best practice that you seem to have adopted, but it's vital that you use character encodings consistently. One trick, that can be used to mitigate situations where you have less control over the encoding of data given to you, is to attempt to convert to Unicode using an encoding that is conservative with regard to acceptable combinations of byte sequences, such as UTF-8; if such a conversion fails, it's quite possible that another encoding applies, such as ISO-8859-1, and you can try that. Since ISO-8859-1 is a liberal encoding, in the sense that any byte value or combination of byte values is acceptable, it should only be used as a last resort. However, it's best to have a high level of control over character encodings rather than using tricks to avoid considering representation issues carefully. Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
On Tue, 2006-11-07 at 08:10 +0200, Hendrik van Rooyen wrote: John Machin [EMAIL PROTECTED] wrote: 8--- I strongly suggest that you read the docs *FIRST*, and don't tinker at all. This is *good* advice - its unlikely to be followed though, as the OP is prolly just like most of us - you unpack the stuff out of the box and start assembling it, and only towards the end, when it wont fit together, do you read the manual to see where you went wrong... I fall right into this camp(fire). I'm always amazed and awed at people who actually read the docs *thoroughly* before starting. I know some people do but frankly, unless it's a step-by-step tutorial, I rarely read the docs beyond getting a basic understanding of what something does before I start tinkering. I've always been a firm believer in the Chinese proverb: I hear and I forget I see and I remember I do and I understand Of course, I usually just skip straight to the third step and try to work backwards as needed. This usually works pretty well but when it doesn't it fails horribly. Unfortunately (for me), working from step one rarely works at all, so that's the boat I'm stuck in. I've always been a bit miffed at the RTFM crowd (and somewhat jealous, I admit). I *do* RTFM, but as often as not the fine manual confuses as much as clarifies. I'm not convinced this is the result of poor documentation so much as that I personally have a different mental approach to problem-solving than the others who find documentation universally enlightening. I also suspect that I'm not alone in my approach and that the RTFM crowd is more than a little close-minded about how others might think about and approach solving problems and understanding concepts. Also, much documentation (including the Python docs) tends to be reference-manual style. This is great if you *already* understand the problem and just need details, but does about as much for *understanding* as a dictionary does for learning a language. When I'm perusing the Python reference manual, I usually find that 10 lines of example code are worth 1000 lines of function descriptions and cross-references. Just my $0.02. Regards, Cliff -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
On Mon, 2006-11-06 at 15:47 -0800, John Machin wrote: Gabriel Genellina wrote: At Monday 6/11/2006 20:34, Robert Kern wrote: John Machin wrote: Indeed yourself. Have you ever considered reading posts in chronological order, or reading all posts in a thread? That presumes that messages arrive in chronological order and transmissions are instantaneous. Neither are true. Sometimes I even got the replies *before* the original post comes. What is in question is the likelihood that message B can appear before message A, when both emanate from the same source, and B was sent about 7 minutes after A. Usenet, email, usenet/email gateways, internet in general... all in all, pretty likely. I've often seen replies to my posts long before my own post shows up. In fact, I've seen posts not show up for several hours. Regards, Cliff -- http://mail.python.org/mailman/listinfo/python-list
Unicode/ascii encoding nightmare
I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word fødselsdag. s = 'f\xc3\x83\xc2\xb8dselsdag' I stored the string as fødselsdag but somewhere in my code it got translated into the mess above and I cannot get the original string back. It cannot be printed in the console or written a plain text-file. I've tried to convert it using s.encode('iso-8859-1') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) s.encode('utf-8') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) And nothing helps. I cannot remember hacing these problems in earlier versions of python and it's really annoying, even if it's my own fault somehow, handling of normal characters like this shouldn't cause this much hassle. Searching google for codec can't decode byte and UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm not alone. Any hints? -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
The string below is the encoding of the norwegian word fødselsdag. s = 'f\xc3\x83\xc2\xb8dselsdag' I'm not sure which encoding method you used to get the string above. Here's the result of my playing with the string in IDLE: u1 = u'fødselsdag' u1 u'f\xf8dselsdag' s1 = u1.encode('utf-8') s1 'f\xc3\xb8dselsdag' u2 = s1.decode('utf-8') u2 u'f\xf8dselsdag' print u2 fødselsdag -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Thomas W wrote: I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word fødselsdag. s = 'f\xc3\x83\xc2\xb8dselsdag' I stored the string as fødselsdag but somewhere in my code it got translated into the mess above and I cannot get the original string back. It cannot be printed in the console or written a plain text-file. I've tried to convert it using s.encode('iso-8859-1') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) s.encode('utf-8') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) And nothing helps. I cannot remember hacing these problems in earlier versions of python and it's really annoying, even if it's my own fault somehow, handling of normal characters like this shouldn't cause this much hassle. Searching google for codec can't decode byte and UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm not alone. You would want .decode() (which converts a byte string into a Unicode string), not .encode() (which converts a Unicode string into a byte string). You get UnicodeDecodeErrors even though you are trying to .encode() because whenever Python is expecting a Unicode string but gets a byte string, it tries to decode the byte string as 7-bit ASCII. If that fails, then it raises a UnicodeDecodeError. However, I don't know of an encoding that takes ufødselsdag to 'f\xc3\x83\xc2\xb8dselsdag'. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Thomas W wrote: I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word fødselsdag. s = 'f\xc3\x83\xc2\xb8dselsdag' There is no such thing as *the* encoding of any given string. I stored the string as fødselsdag but somewhere in my code it got translated into the mess above and I cannot get the original string back. Somewhere in your code??? Can't you track through your code to see where it is being changed? Failing that, can't you show us your code so that we can help you? I have guessed *what* you got, but *how* you got it boggles the mind: The effect is the same as (decode from latin1 to Unicode, encode as utf8) *TWICE*. That's how you change one byte in the original to *FOUR* bytes in the mess: | orig = 'f\xf8dselsdag' | orig.decode('latin1').encode('utf8') | 'f\xc3\xb8dselsdag' | orig.decode('latin1').encode('utf8').decode('latin1').encode('utf8') | 'f\xc3\x83\xc2\xb8dselsdag' | It cannot be printed in the console or written a plain text-file. Incorrect. *Any* string can be printed on the console or written to a file. What you mean is that when you look at the output, it is not what you want. I've tried to convert it using s.encode('iso-8859-1') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) encode is an attribute of unicode objects. If applied to a str object, the str object is converted to unicode first using the default codec (typically ascii). s.encode('iso-8859-1') is effectively s.decode('ascii').encode('iso-8859-1'), and s.decode('ascii') fails for the (obvious(?)) reason given. s.encode('utf-8') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) Same story as for 'iso-8859-1' And nothing helps. I cannot remember hacing these problems in earlier versions of python I would be very surprised if you couldn't reproduce your problem on any 2.n version of Python. and it's really annoying, even if it's my own fault somehow, handling of normal characters like this shouldn't cause this much hassle. Searching google for codec can't decode byte and UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm not alone. Any hints? 1. Read the Unicode howto: http://www.amk.ca/python/howto/unicode 2. Read the Python documentation on .decode() and .encode() carefully. 3. Show us your code so that we can help you avoid the double conversion to utf8. Tell us what IDE you are using. 4. Tell us what you are trying to achieve. Note that if all you are trying to do is read and write text in Norwegian (or any other language that's representable in iso-8859-1 aka latin1), then you don't have to do anything special at all in your code-- this is the good old legacy way of doing things universally in vogue before Unicode was invented! HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Robert Kern wrote: However, I don't know of an encoding that takes ufødselsdag to 'f\xc3\x83\xc2\xb8dselsdag'. There isn't one. C3 and C2 hint at UTF-8. The fact that C3 and C2 are both present, plus the fact that one non-ASCII byte has morphoploded into 4 bytes indicate a double whammy. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
John Machin wrote: The fact that C3 and C2 are both present, plus the fact that one non-ASCII byte has morphoploded into 4 bytes indicate a double whammy. Indeed... x = ufødselsdag x.encode('utf-8').decode('iso-8859-1').encode('utf-8') 'f\xc3\x83\xc2\xb8dselsdag' Andrea -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Thomas W wrote: I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word fødselsdag. s = 'f\xc3\x83\xc2\xb8dselsdag' Which encoding is this? I stored the string as fødselsdag but somewhere in my code it got You stored it where? translated into the mess above and I cannot get the original string back. It cannot be printed in the console or written a plain text-file. I've tried to convert it using s.encode('iso-8859-1') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) Note that encode on a string object is often an indication for an error. The encoding direction (for normal encodings, not special things like the zlib codec) is as follows: encode: from Unicode decode: to Unicode (the encode method of strings first DEcodes the string with the default encoding, which is normally ascii, then ENcodes it with the given encoding) s.encode('utf-8') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) And nothing helps. I cannot remember hacing these problems in earlier versions of python and it's really annoying, even if it's my own fault somehow, handling of normal characters like this shouldn't cause this much hassle. Searching google for codec can't decode byte and UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm not alone. Unicode causes many problems if not used properly. If you want to use Unicode strings, use them everywhere in your Python application, decode input as early as possible, and encode output only before writing it to a file or another program. Georg -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Ok, I've cleaned up my code abit and it seems as if I've encoded/decoded myself into a corner ;-). My understanding of unicode has room for improvement, that's for sure. I got some pointers and initial code-cleanup seem to have removed some of the strange results I got, which several of you also pointed out. Anyway, thanks for all your replies. I think I can get this thing up and running with a bit more code tinkering. And I'll read up on some unicode-docs as well. :-) Thanks again. Thomas John Machin wrote: Thomas W wrote: I'm getting really annoyed with python in regards to unicode/ascii-encoding problems. The string below is the encoding of the norwegian word fødselsdag. s = 'f\xc3\x83\xc2\xb8dselsdag' There is no such thing as *the* encoding of any given string. I stored the string as fødselsdag but somewhere in my code it got translated into the mess above and I cannot get the original string back. Somewhere in your code??? Can't you track through your code to see where it is being changed? Failing that, can't you show us your code so that we can help you? I have guessed *what* you got, but *how* you got it boggles the mind: The effect is the same as (decode from latin1 to Unicode, encode as utf8) *TWICE*. That's how you change one byte in the original to *FOUR* bytes in the mess: | orig = 'f\xf8dselsdag' | orig.decode('latin1').encode('utf8') | 'f\xc3\xb8dselsdag' | orig.decode('latin1').encode('utf8').decode('latin1').encode('utf8') | 'f\xc3\x83\xc2\xb8dselsdag' | It cannot be printed in the console or written a plain text-file. Incorrect. *Any* string can be printed on the console or written to a file. What you mean is that when you look at the output, it is not what you want. I've tried to convert it using s.encode('iso-8859-1') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) encode is an attribute of unicode objects. If applied to a str object, the str object is converted to unicode first using the default codec (typically ascii). s.encode('iso-8859-1') is effectively s.decode('ascii').encode('iso-8859-1'), and s.decode('ascii') fails for the (obvious(?)) reason given. s.encode('utf-8') Traceback (most recent call last): File interactive input, line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128) Same story as for 'iso-8859-1' And nothing helps. I cannot remember hacing these problems in earlier versions of python I would be very surprised if you couldn't reproduce your problem on any 2.n version of Python. and it's really annoying, even if it's my own fault somehow, handling of normal characters like this shouldn't cause this much hassle. Searching google for codec can't decode byte and UnicodeDecodeError etc. produces a bunch of hits so it's obvious I'm not alone. Any hints? 1. Read the Unicode howto: http://www.amk.ca/python/howto/unicode 2. Read the Python documentation on .decode() and .encode() carefully. 3. Show us your code so that we can help you avoid the double conversion to utf8. Tell us what IDE you are using. 4. Tell us what you are trying to achieve. Note that if all you are trying to do is read and write text in Norwegian (or any other language that's representable in iso-8859-1 aka latin1), then you don't have to do anything special at all in your code-- this is the good old legacy way of doing things universally in vogue before Unicode was invented! HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Thomas W wrote: Ok, I've cleaned up my code abit and it seems as if I've encoded/decoded myself into a corner ;-). My understanding of unicode has room for improvement, that's for sure. I got some pointers and initial code-cleanup seem to have removed some of the strange results I got, which several of you also pointed out. Anyway, thanks for all your replies. I think I can get this thing up and running with a bit more code tinkering. And I'll read up on some unicode-docs as well. :-) Thanks again. I strongly suggest that you read the docs *FIRST*, and don't tinker at all. HTH, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Andrea Griffini wrote: John Machin wrote: The fact that C3 and C2 are both present, plus the fact that one non-ASCII byte has morphoploded into 4 bytes indicate a double whammy. Indeed... x = ufødselsdag x.encode('utf-8').decode('iso-8859-1').encode('utf-8') 'f\xc3\x83\xc2\xb8dselsdag' Indeed yourself. Have you ever considered reading posts in chronological order, or reading all posts in a thread? It might help you avoid writing posts with non-zero information content. Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
John Machin wrote: Indeed yourself. Have you ever considered reading posts in chronological order, or reading all posts in a thread? That presumes that messages arrive in chronological order and transmissions are instantaneous. Neither are true. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
At Monday 6/11/2006 20:34, Robert Kern wrote: John Machin wrote: Indeed yourself. Have you ever considered reading posts in chronological order, or reading all posts in a thread? That presumes that messages arrive in chronological order and transmissions are instantaneous. Neither are true. Sometimes I even got the replies *before* the original post comes. -- Gabriel Genellina Softlab SRL __ Correo Yahoo! Espacio para todos tus mensajes, antivirus y antispam ¡gratis! ¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Gabriel Genellina wrote: At Monday 6/11/2006 20:34, Robert Kern wrote: John Machin wrote: Indeed yourself. Have you ever considered reading posts in chronological order, or reading all posts in a thread? That presumes that messages arrive in chronological order and transmissions are instantaneous. Neither are true. Sometimes I even got the replies *before* the original post comes. What is in question is the likelihood that message B can appear before message A, when both emanate from the same source, and B was sent about 7 minutes after A. -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
In article [EMAIL PROTECTED], John Machin [EMAIL PROTECTED] wrote: Thomas W wrote: Ok, I've cleaned up my code abit and it seems as if I've encoded/decoded myself into a corner ;-). My understanding of unicode has room for improvement, that's for sure. I got some pointers and initial code-cleanup seem to have removed some of the strange results I got, which several of you also pointed out. Anyway, thanks for all your replies. I think I can get this thing up and running with a bit more code tinkering. And I'll read up on some unicode-docs as well. :-) Thanks again. I strongly suggest that you read the docs *FIRST*, and don't tinker at all. . . . Does URL: http://www.unixreview.com/documents/s=10102/ur0611d/ur0610d.htm help? -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
Cameron Laird wrote: In article [EMAIL PROTECTED], John Machin [EMAIL PROTECTED] wrote: Thomas W wrote: Ok, I've cleaned up my code abit and it seems as if I've encoded/decoded myself into a corner ;-). My understanding of unicode has room for improvement, that's for sure. I got some pointers and initial code-cleanup seem to have removed some of the strange results I got, which several of you also pointed out. Anyway, thanks for all your replies. I think I can get this thing up and running with a bit more code tinkering. And I'll read up on some unicode-docs as well. :-) Thanks again. I strongly suggest that you read the docs *FIRST*, and don't tinker at all. . . . Does URL: http://www.unixreview.com/documents/s=10102/ur0611d/ur0610d.htm help? Hi Cameron, Yes. Sorry for short reply -- gotta run to bookshop :-) Cheers, John -- http://mail.python.org/mailman/listinfo/python-list
Re: Unicode/ascii encoding nightmare
John Machin [EMAIL PROTECTED] wrote: 8--- I strongly suggest that you read the docs *FIRST*, and don't tinker at all. HTH, John This is *good* advice - its unlikely to be followed though, as the OP is prolly just like most of us - you unpack the stuff out of the box and start assembling it, and only towards the end, when it wont fit together, do you read the manual to see where you went wrong... - Hendrik -- http://mail.python.org/mailman/listinfo/python-list