Re: What encoding does u'...' syntax use?
In article 499f397c.7030...@v.loewis.de, =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= mar...@v.loewis.de wrote: Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the countless threads about the distinction between UTF and UCS? -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ Weinberg's Second Law: If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization. -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
* Martin v. Löwis (Sat, 21 Feb 2009 00:15:08 +0100) Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a slight difference to UTF-16/UTF-32). Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
On Sat, Feb 21, 2009 at 7:24 PM, Thorsten Kampe thors...@thorstenkampe.de wrote: I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a slight difference to UTF-16/UTF-32). I wouldn't call the difference that slight, especially between UTF-16 and UCS-2, since the former can encode all Unicode code points, while the latter can only encode those in the BMP. -- Denis Kasak -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the countless threads about the distinction between UTF and UCS? You are not misremembering. I personally never found them conclusive, and, with PEP 261, I think, calling the 2-byte version UCS-2 is incorrect. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a slight difference to UTF-16/UTF-32). I wouldn't call the difference that slight, especially between UTF-16 and UCS-2, since the former can encode all Unicode code points, while the latter can only encode those in the BMP. Indeed. As Python *can* encode all characters even in 2-byte mode (since PEP 261), it seems clear that Python's Unicode representation is *not* strictly UCS-2 anymore. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
On Sat, Feb 21, 2009 at 9:10 PM, Martin v. Löwis mar...@v.loewis.de wrote: I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a slight difference to UTF-16/UTF-32). I wouldn't call the difference that slight, especially between UTF-16 and UCS-2, since the former can encode all Unicode code points, while the latter can only encode those in the BMP. Indeed. As Python *can* encode all characters even in 2-byte mode (since PEP 261), it seems clear that Python's Unicode representation is *not* strictly UCS-2 anymore. Since we're already discussing this, I'm curious - why was UCS-2 chosen over plain UTF-16 or UTF-8 in the first place for Python's internal storage? -- Denis Kasak -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
On Feb 21, 10:48 am, a...@pythoncraft.com (Aahz) wrote: In article 499f397c.7030...@v.loewis.de, =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= mar...@v.loewis.de wrote: Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the countless threads about the distinction between UTF and UCS? Nope, that's partly mislabeling and partly a bug. UCS-2/UCS-4 refer to Unicode 1.1 and earlier, with no surrogates. We target Unicode 5.1. If you naively encode UCS-2 as UTF-8 you really end up with CESU-8. You miss the step where you combine surrogate pairs (which only exist in UTF-16) into a single supplementary character. Lo and behold, that's actually what current python does in some places. It's not pretty. See bugs #3297 and #3672. -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
Indeed. As Python *can* encode all characters even in 2-byte mode (since PEP 261), it seems clear that Python's Unicode representation is *not* strictly UCS-2 anymore. Since we're already discussing this, I'm curious - why was UCS-2 chosen over plain UTF-16 or UTF-8 in the first place for Python's internal storage? You mean, originally? Originally, the choice was only between UCS-2 and UCS-4; choice was in favor of UCS-2 because of size concerns. UTF-8 was ruled out easily because it doesn't allow constant-size indexing; UTF-16 essentially for the same reason (plus there was no point to UTF-16, since there were no assigned characters outside the BMP). Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
On Sat, Feb 21, 2009 at 9:45 PM, Martin v. Löwis mar...@v.loewis.de wrote: Indeed. As Python *can* encode all characters even in 2-byte mode (since PEP 261), it seems clear that Python's Unicode representation is *not* strictly UCS-2 anymore. Since we're already discussing this, I'm curious - why was UCS-2 chosen over plain UTF-16 or UTF-8 in the first place for Python's internal storage? You mean, originally? Originally, the choice was only between UCS-2 and UCS-4; choice was in favor of UCS-2 because of size concerns. UTF-8 was ruled out easily because it doesn't allow constant-size indexing; UTF-16 essentially for the same reason (plus there was no point to UTF-16, since there were no assigned characters outside the BMP). Yes, I failed to realise how long ago the unicode data type was implemented originally. :-) Thanks for the explanation. -- Denis Kasak -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
Ron Garret wrote: I would have thought that the answer would be: the default encoding (duh!) But empirically this appears not to be the case: unicode('\xb5') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in range(128) u'\xb5' u'\xb5' print u'\xb5' µ (That last character shows up as a micron sign despite the fact that my default encoding is ascii, so it seems to me that that unicode string must somehow have picked up a latin-1 encoding.) You are mixing up console output and internal data representation. What you see in the last line is what the Python interpreter makes of your unicode string when passing it into stdout, which in your case seems to use a latin-1 encoding (check your environment settings for that). BTW, Unicode is not an encoding. Wikipedia will tell you more. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
Stefan Behnel wrote: print u'\xb5' µ What you see in the last line is what the Python interpreter makes of your unicode string when passing it into stdout, which in your case seems to use a latin-1 encoding (check your environment settings for that). The seems to is misleading. The example doesn't actually tell you anything about the encoding used by your console, except that it can display non-ASCII characters. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
In article 499f18bd$0$31879$9b4e6...@newsspool3.arcor-online.net, Stefan Behnel stefan...@behnel.de wrote: Ron Garret wrote: I would have thought that the answer would be: the default encoding (duh!) But empirically this appears not to be the case: unicode('\xb5') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in range(128) u'\xb5' u'\xb5' print u'\xb5' µ (That last character shows up as a micron sign despite the fact that my default encoding is ascii, so it seems to me that that unicode string must somehow have picked up a latin-1 encoding.) You are mixing up console output and internal data representation. What you see in the last line is what the Python interpreter makes of your unicode string when passing it into stdout, which in your case seems to use a latin-1 encoding (check your environment settings for that). BTW, Unicode is not an encoding. Wikipedia will tell you more. Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? It can't be ascii. So what is it? Put this another way: I would have thought that when the Python parser parses u'\xb5' it would produce the same result as calling unicode('\xb5'), but it doesn't. Instead it seems to produce the same result as calling unicode('\xb5', 'latin-1'). But my default encoding is not latin-1, it's ascii. So where is the Python parser getting its encoding from? Why does parsing u'\xb5' not produce the same error as calling unicode('\xb5')? rg -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
Ron Garret wrote: I would have thought that the answer would be: the default encoding (duh!) But empirically this appears not to be the case: unicode('\xb5') Traceback (most recent call last): File stdin, line 1, in module UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in range(128) The unicode function is usually used to decode bytes read from *external sources*, each of which can have its own encoding. So the function (actually, developer crew) refuses to guess and uses the ascii common subset. u'\xb5' u'\xb5' print u'\xb5' � Unicode literals are *in the source file*, which can only have one encoding (for a given source file). (That last character shows up as a micron sign despite the fact that my default encoding is ascii, so it seems to me that that unicode string must somehow have picked up a latin-1 encoding.) I think latin-1 was the default without a coding cookie line. (May be uft-8 in 3.0). -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
Ron Garret rnospa...@flownet.com writes: Put this another way: I would have thought that when the Python parser parses u'\xb5' it would produce the same result as calling unicode('\xb5'), but it doesn't. Instead it seems to produce the same result as calling unicode('\xb5', 'latin-1'). But my default encoding is not latin-1, it's ascii. So where is the Python parser getting its encoding from? Why does parsing u'\xb5' not produce the same error as calling unicode('\xb5')? There is no encoding involved other than ascii, only processing of a backslash escape. The backslash escape '\xb5' is converted to the unicode character whose ordinal number is B5h. This gives the same result as \xb5.decode(latin-1) because the unicode numbering is the same as the 'latin-1' numbering in that range. -M- -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). Put this another way: I would have thought that when the Python parser parses u'\xb5' it would produce the same result as calling unicode('\xb5'), but it doesn't. Right. In the former case, \xb5 denotes a Unicode character, namely U+00B5, MICRO SIGN. It is the same as u\u00b5, and still the same as u\N{MICRO SIGN}. By the same, I mean the very same. OTOH, unicode('\xb5') is something entirely different. '\xb5' is a byte string with length 1, with a single byte with the numeric value 0xb5, or 181. It does not, per se, denote any specific character. It only gets a character meaning when you try to decode it to unicode, which you do with unicode('\xb5'). This is short for unicode('\xb5', sys.getdefaultencoding()) and sys.getdefaultencoding() is (or should be) ascii. Now, in ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote a character at all), hence you get a UnicodeError. Instead it seems to produce the same result as calling unicode('\xb5', 'latin-1'). Sure. However, this is only by coincidence, because latin-1 has the same code points as Unicode (for 0..255). But my default encoding is not latin-1, it's ascii. So where is the Python parser getting its encoding from? Why does parsing u'\xb5' not produce the same error as calling unicode('\xb5')? Because \xb5 *directly* refers to character U+00b5, with no byte-oriented encoding in-between. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
u'\xb5' u'\xb5' print u'\xb5' � Unicode literals are *in the source file*, which can only have one encoding (for a given source file). (That last character shows up as a micron sign despite the fact that my default encoding is ascii, so it seems to me that that unicode string must somehow have picked up a latin-1 encoding.) I think latin-1 was the default without a coding cookie line. (May be uft-8 in 3.0). It is, but that's irrelevant for the example. In the source u'\xb5' all characters are ASCII (i.e. all of letter u, single quote, backslash, letter x, letter b, digit 5). As a consequence, this source text has the same meaning in all supported source encodings (as source encodings must be ASCII supersets). The Unicode literal shown here does not get its interpretation from Latin-1. Instead, it directly gets its interpretation from the Unicode coded character set. The string is a short-hand for u'\u00b5' and this denotes character U+00B5 (just as u'\u20ac denotes U+20AC; the same holds for any other u'\u'). HTH, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
In article 499f3a8f.9010...@v.loewis.de, Martin v. Löwis mar...@v.loewis.de wrote: u'\xb5' u'\xb5' print u'\xb5' ? Unicode literals are *in the source file*, which can only have one encoding (for a given source file). (That last character shows up as a micron sign despite the fact that my default encoding is ascii, so it seems to me that that unicode string must somehow have picked up a latin-1 encoding.) I think latin-1 was the default without a coding cookie line. (May be uft-8 in 3.0). It is, but that's irrelevant for the example. In the source u'\xb5' all characters are ASCII (i.e. all of letter u, single quote, backslash, letter x, letter b, digit 5). As a consequence, this source text has the same meaning in all supported source encodings (as source encodings must be ASCII supersets). The Unicode literal shown here does not get its interpretation from Latin-1. Instead, it directly gets its interpretation from the Unicode coded character set. The string is a short-hand for u'\u00b5' and this denotes character U+00B5 (just as u'\u20ac denotes U+20AC; the same holds for any other u'\u'). HTH, Martin Ah, that makes sense. Thanks! rg -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
In article 499f397c.7030...@v.loewis.de, Martin v. Löwis mar...@v.loewis.de wrote: Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). Put this another way: I would have thought that when the Python parser parses u'\xb5' it would produce the same result as calling unicode('\xb5'), but it doesn't. Right. In the former case, \xb5 denotes a Unicode character, namely U+00B5, MICRO SIGN. It is the same as u\u00b5, and still the same as u\N{MICRO SIGN}. By the same, I mean the very same. OTOH, unicode('\xb5') is something entirely different. '\xb5' is a byte string with length 1, with a single byte with the numeric value 0xb5, or 181. It does not, per se, denote any specific character. It only gets a character meaning when you try to decode it to unicode, which you do with unicode('\xb5'). This is short for unicode('\xb5', sys.getdefaultencoding()) and sys.getdefaultencoding() is (or should be) ascii. Now, in ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote a character at all), hence you get a UnicodeError. Instead it seems to produce the same result as calling unicode('\xb5', 'latin-1'). Sure. However, this is only by coincidence, because latin-1 has the same code points as Unicode (for 0..255). But my default encoding is not latin-1, it's ascii. So where is the Python parser getting its encoding from? Why does parsing u'\xb5' not produce the same error as calling unicode('\xb5')? Because \xb5 *directly* refers to character U+00b5, with no byte-oriented encoding in-between. Regards, Martin OK, I think I get it now. Thanks! rg -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
Martin v. Löwis wrote: mehow have picked up a latin-1 encoding.) I think latin-1 was the default without a coding cookie line. (May be uft-8 in 3.0). It is, but that's irrelevant for the example. In the source u'\xb5' all characters are ASCII (i.e. all of letter u, single quote, backslash, letter x, letter b, digit 5). As a consequence, this source text has the same meaning in all supported source encodings (as source encodings must be ASCII supersets). I think I understand now that the coding cookie only matters if I use an editor that actually stores *non-ascii* bytes in the file for the Python parser to interpret. -- http://mail.python.org/mailman/listinfo/python-list