Re: What encoding does u'...' syntax use?

2009-02-21 Thread Aahz
In article 499f397c.7030...@v.loewis.de,
=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=  mar...@v.loewis.de wrote:
 Yes, I know that.  But every concrete representation of a unicode string 
 has to have an encoding associated with it, including unicode strings 
 produced by the Python parser when it parses the ascii string u'\xb5'
 
 My question is: what is that encoding?

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

Wait, I thought it was UCS-2 or UCS-4?  Or am I misremembering the
countless threads about the distinction between UTF and UCS?
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

Weinberg's Second Law: If builders built buildings the way programmers wrote 
programs, then the first woodpecker that came along would destroy civilization.
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Thorsten Kampe
* Martin v. Löwis (Sat, 21 Feb 2009 00:15:08 +0100)
  Yes, I know that. But every concrete representation of a unicode
  string has to have an encoding associated with it, including unicode
  strings produced by the Python parser when it parses the ascii
  string u'\xb5'
  
  My question is: what is that encoding?
 
 The internal representation is either UTF-16, or UTF-32; which one is
 a compile-time choice (i.e. when the Python interpreter is built).

I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a 
slight difference to UTF-16/UTF-32).

Thorsten
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Denis Kasak
On Sat, Feb 21, 2009 at 7:24 PM, Thorsten Kampe
thors...@thorstenkampe.de wrote:

 I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
 slight difference to UTF-16/UTF-32).

I wouldn't call the difference that slight, especially between UTF-16
and UCS-2, since the former can encode all Unicode code points, while
the latter can only encode those in the BMP.

--
Denis Kasak
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Martin v. Löwis
 My question is: what is that encoding?
 The internal representation is either UTF-16, or UTF-32; which one is
 a compile-time choice (i.e. when the Python interpreter is built).
 
 Wait, I thought it was UCS-2 or UCS-4?  Or am I misremembering the
 countless threads about the distinction between UTF and UCS?

You are not misremembering. I personally never found them conclusive,
and, with PEP 261, I think, calling the 2-byte version UCS-2 is
incorrect.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Martin v. Löwis
 I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
 slight difference to UTF-16/UTF-32).
 
 I wouldn't call the difference that slight, especially between UTF-16
 and UCS-2, since the former can encode all Unicode code points, while
 the latter can only encode those in the BMP.

Indeed. As Python *can* encode all characters even in 2-byte mode
(since PEP 261), it seems clear that Python's Unicode representation
is *not* strictly UCS-2 anymore.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Denis Kasak
On Sat, Feb 21, 2009 at 9:10 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 I'm pretty much sure it is UCS-2 or UCS-4. (Yes, I know there is only a
 slight difference to UTF-16/UTF-32).

 I wouldn't call the difference that slight, especially between UTF-16
 and UCS-2, since the former can encode all Unicode code points, while
 the latter can only encode those in the BMP.

 Indeed. As Python *can* encode all characters even in 2-byte mode
 (since PEP 261), it seems clear that Python's Unicode representation
 is *not* strictly UCS-2 anymore.

Since we're already discussing this, I'm curious - why was UCS-2
chosen over plain UTF-16 or UTF-8 in the first place for Python's
internal storage?

-- 
Denis Kasak
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Adam Olsen
On Feb 21, 10:48 am, a...@pythoncraft.com (Aahz) wrote:
 In article 499f397c.7030...@v.loewis.de,

 =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=  mar...@v.loewis.de wrote:
  Yes, I know that.  But every concrete representation of a unicode string
  has to have an encoding associated with it, including unicode strings
  produced by the Python parser when it parses the ascii string u'\xb5'

  My question is: what is that encoding?

 The internal representation is either UTF-16, or UTF-32; which one is
 a compile-time choice (i.e. when the Python interpreter is built).

 Wait, I thought it was UCS-2 or UCS-4?  Or am I misremembering the
 countless threads about the distinction between UTF and UCS?

Nope, that's partly mislabeling and partly a bug.  UCS-2/UCS-4 refer
to Unicode 1.1 and earlier, with no surrogates.  We target Unicode
5.1.

If you naively encode UCS-2 as UTF-8 you really end up with CESU-8.
You miss the step where you combine surrogate pairs (which only exist
in UTF-16) into a single supplementary character.  Lo and behold,
that's actually what current python does in some places.  It's not
pretty.

See bugs #3297 and #3672.
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Martin v. Löwis
 Indeed. As Python *can* encode all characters even in 2-byte mode
 (since PEP 261), it seems clear that Python's Unicode representation
 is *not* strictly UCS-2 anymore.
 
 Since we're already discussing this, I'm curious - why was UCS-2
 chosen over plain UTF-16 or UTF-8 in the first place for Python's
 internal storage?

You mean, originally? Originally, the choice was only between UCS-2
and UCS-4; choice was in favor of UCS-2 because of size concerns.
UTF-8 was ruled out easily because it doesn't allow constant-size
indexing; UTF-16 essentially for the same reason (plus there was
no point to UTF-16, since there were no assigned characters outside
the BMP).

Regards,
Martin



--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-21 Thread Denis Kasak
On Sat, Feb 21, 2009 at 9:45 PM, Martin v. Löwis mar...@v.loewis.de wrote:
 Indeed. As Python *can* encode all characters even in 2-byte mode
 (since PEP 261), it seems clear that Python's Unicode representation
 is *not* strictly UCS-2 anymore.

 Since we're already discussing this, I'm curious - why was UCS-2
 chosen over plain UTF-16 or UTF-8 in the first place for Python's
 internal storage?

 You mean, originally? Originally, the choice was only between UCS-2
 and UCS-4; choice was in favor of UCS-2 because of size concerns.
 UTF-8 was ruled out easily because it doesn't allow constant-size
 indexing; UTF-16 essentially for the same reason (plus there was
 no point to UTF-16, since there were no assigned characters outside
 the BMP).

Yes, I failed to realise how long ago the unicode data type was
implemented originally. :-)
Thanks for the explanation.

-- 
Denis Kasak
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Stefan Behnel
Ron Garret wrote:
 I would have thought that the answer would be: the default encoding 
 (duh!)  But empirically this appears not to be the case:
 
 unicode('\xb5')
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: 
 ordinal not in range(128)
 u'\xb5'
 u'\xb5'
 print u'\xb5'
 µ
 
 (That last character shows up as a micron sign despite the fact that my 
 default encoding is ascii, so it seems to me that that unicode string 
 must somehow have picked up a latin-1 encoding.)

You are mixing up console output and internal data representation. What you
see in the last line is what the Python interpreter makes of your unicode
string when passing it into stdout, which in your case seems to use a
latin-1 encoding (check your environment settings for that).

BTW, Unicode is not an encoding. Wikipedia will tell you more.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Stefan Behnel
Stefan Behnel wrote:
 print u'\xb5'
 µ
 
 What you
 see in the last line is what the Python interpreter makes of your unicode
 string when passing it into stdout, which in your case seems to use a
 latin-1 encoding (check your environment settings for that).

The seems to is misleading. The example doesn't actually tell you
anything about the encoding used by your console, except that it can
display non-ASCII characters.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Ron Garret
In article 499f18bd$0$31879$9b4e6...@newsspool3.arcor-online.net,
 Stefan Behnel stefan...@behnel.de wrote:

 Ron Garret wrote:
  I would have thought that the answer would be: the default encoding 
  (duh!)  But empirically this appears not to be the case:
  
  unicode('\xb5')
  Traceback (most recent call last):
File stdin, line 1, in module
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: 
  ordinal not in range(128)
  u'\xb5'
  u'\xb5'
  print u'\xb5'
  µ
  
  (That last character shows up as a micron sign despite the fact that my 
  default encoding is ascii, so it seems to me that that unicode string 
  must somehow have picked up a latin-1 encoding.)
 
 You are mixing up console output and internal data representation. What you
 see in the last line is what the Python interpreter makes of your unicode
 string when passing it into stdout, which in your case seems to use a
 latin-1 encoding (check your environment settings for that).
 
 BTW, Unicode is not an encoding. Wikipedia will tell you more.

Yes, I know that.  But every concrete representation of a unicode string 
has to have an encoding associated with it, including unicode strings 
produced by the Python parser when it parses the ascii string u'\xb5'

My question is: what is that encoding?  It can't be ascii.  So what is 
it?

Put this another way: I would have thought that when the Python parser 
parses u'\xb5' it would produce the same result as calling 
unicode('\xb5'), but it doesn't.  Instead it seems to produce the same 
result as calling unicode('\xb5', 'latin-1').  But my default encoding 
is not latin-1, it's ascii.  So where is the Python parser getting its 
encoding from?  Why does parsing u'\xb5' not produce the same error as 
calling unicode('\xb5')?

rg
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Terry Reedy

Ron Garret wrote:
I would have thought that the answer would be: the default encoding 
(duh!)  But empirically this appears not to be the case:



unicode('\xb5')

Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: 
ordinal not in range(128)


The unicode function is usually used to decode bytes read from *external 
sources*, each of which can have its own encoding.  So the function 
(actually, developer crew) refuses to guess and uses the ascii common 
subset.



u'\xb5'

u'\xb5'

print u'\xb5'

�


Unicode literals are *in the source file*, which can only have one 
encoding (for a given source file).


(That last character shows up as a micron sign despite the fact that my 
default encoding is ascii, so it seems to me that that unicode string 
must somehow have picked up a latin-1 encoding.)


I think latin-1 was the default without a coding cookie line.  (May be 
uft-8 in 3.0).


--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Matthew Woodcraft
Ron Garret rnospa...@flownet.com writes:
 Put this another way: I would have thought that when the Python parser
 parses u'\xb5' it would produce the same result as calling
 unicode('\xb5'), but it doesn't. Instead it seems to produce the same
 result as calling unicode('\xb5', 'latin-1'). But my default encoding
 is not latin-1, it's ascii. So where is the Python parser getting its
 encoding from? Why does parsing u'\xb5' not produce the same error
 as calling unicode('\xb5')?

There is no encoding involved other than ascii, only processing of a
backslash escape.

The backslash escape '\xb5' is converted to the unicode character whose
ordinal number is B5h. This gives the same result as
\xb5.decode(latin-1) because the unicode numbering is the same as
the 'latin-1' numbering in that range.

-M-
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Martin v. Löwis
 Yes, I know that.  But every concrete representation of a unicode string 
 has to have an encoding associated with it, including unicode strings 
 produced by the Python parser when it parses the ascii string u'\xb5'
 
 My question is: what is that encoding?

The internal representation is either UTF-16, or UTF-32; which one is
a compile-time choice (i.e. when the Python interpreter is built).

 Put this another way: I would have thought that when the Python parser 
 parses u'\xb5' it would produce the same result as calling 
 unicode('\xb5'), but it doesn't.

Right. In the former case, \xb5 denotes a Unicode character, namely
U+00B5, MICRO SIGN. It is the same as u\u00b5, and still the same
as u\N{MICRO SIGN}. By the same, I mean the very same.

OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
byte string with length 1, with a single byte with the numeric
value 0xb5, or 181. It does not, per se, denote any specific character.
It only gets a character meaning when you try to decode it to unicode,
which you do with unicode('\xb5'). This is short for

  unicode('\xb5', sys.getdefaultencoding())

and sys.getdefaultencoding() is (or should be) ascii. Now, in
ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
a character at all), hence you get a UnicodeError.

 Instead it seems to produce the same 
 result as calling unicode('\xb5', 'latin-1').

Sure. However, this is only by coincidence, because latin-1 has the same
code points as Unicode (for 0..255).

 But my default encoding 
 is not latin-1, it's ascii.  So where is the Python parser getting its 
 encoding from?  Why does parsing u'\xb5' not produce the same error as 
 calling unicode('\xb5')?

Because \xb5 *directly* refers to character U+00b5, with no
byte-oriented encoding in-between.

Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Martin v. Löwis

 u'\xb5'
 u'\xb5'
 print u'\xb5'
 �
 
 Unicode literals are *in the source file*, which can only have one
 encoding (for a given source file).
 
 (That last character shows up as a micron sign despite the fact that
 my default encoding is ascii, so it seems to me that that unicode
 string must somehow have picked up a latin-1 encoding.)
 
 I think latin-1 was the default without a coding cookie line.  (May be
 uft-8 in 3.0).

It is, but that's irrelevant for the example. In the source

  u'\xb5'

all characters are ASCII (i.e. all of letter u, single
quote, backslash, letter x, letter b, digit 5).
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).

The Unicode literal shown here does not get its interpretation
from Latin-1. Instead, it directly gets its interpretation from
the Unicode coded character set. The string is a short-hand
for

 u'\u00b5'

and this denotes character U+00B5 (just as u'\u20ac denotes
U+20AC; the same holds for any other u'\u').

HTH,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Ron Garret
In article 499f3a8f.9010...@v.loewis.de,
 Martin v. Löwis mar...@v.loewis.de wrote:

  u'\xb5'
  u'\xb5'
  print u'\xb5'
  ?
  
  Unicode literals are *in the source file*, which can only have one
  encoding (for a given source file).
  
  (That last character shows up as a micron sign despite the fact that
  my default encoding is ascii, so it seems to me that that unicode
  string must somehow have picked up a latin-1 encoding.)
  
  I think latin-1 was the default without a coding cookie line.  (May be
  uft-8 in 3.0).
 
 It is, but that's irrelevant for the example. In the source
 
   u'\xb5'
 
 all characters are ASCII (i.e. all of letter u, single
 quote, backslash, letter x, letter b, digit 5).
 As a consequence, this source text has the same meaning in all
 supported source encodings (as source encodings must be ASCII
 supersets).
 
 The Unicode literal shown here does not get its interpretation
 from Latin-1. Instead, it directly gets its interpretation from
 the Unicode coded character set. The string is a short-hand
 for
 
  u'\u00b5'
 
 and this denotes character U+00B5 (just as u'\u20ac denotes
 U+20AC; the same holds for any other u'\u').
 
 HTH,
 Martin

Ah, that makes sense.  Thanks!

rg
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Ron Garret
In article 499f397c.7030...@v.loewis.de,
 Martin v. Löwis mar...@v.loewis.de wrote:

  Yes, I know that.  But every concrete representation of a unicode string 
  has to have an encoding associated with it, including unicode strings 
  produced by the Python parser when it parses the ascii string u'\xb5'
  
  My question is: what is that encoding?
 
 The internal representation is either UTF-16, or UTF-32; which one is
 a compile-time choice (i.e. when the Python interpreter is built).
 
  Put this another way: I would have thought that when the Python parser 
  parses u'\xb5' it would produce the same result as calling 
  unicode('\xb5'), but it doesn't.
 
 Right. In the former case, \xb5 denotes a Unicode character, namely
 U+00B5, MICRO SIGN. It is the same as u\u00b5, and still the same
 as u\N{MICRO SIGN}. By the same, I mean the very same.
 
 OTOH, unicode('\xb5') is something entirely different. '\xb5' is a
 byte string with length 1, with a single byte with the numeric
 value 0xb5, or 181. It does not, per se, denote any specific character.
 It only gets a character meaning when you try to decode it to unicode,
 which you do with unicode('\xb5'). This is short for
 
   unicode('\xb5', sys.getdefaultencoding())
 
 and sys.getdefaultencoding() is (or should be) ascii. Now, in
 ASCII, byte 0xb5 does not have a meaning (i.e. it does not denote
 a character at all), hence you get a UnicodeError.
 
  Instead it seems to produce the same 
  result as calling unicode('\xb5', 'latin-1').
 
 Sure. However, this is only by coincidence, because latin-1 has the same
 code points as Unicode (for 0..255).
 
  But my default encoding 
  is not latin-1, it's ascii.  So where is the Python parser getting its 
  encoding from?  Why does parsing u'\xb5' not produce the same error as 
  calling unicode('\xb5')?
 
 Because \xb5 *directly* refers to character U+00b5, with no
 byte-oriented encoding in-between.
 
 Regards,
 Martin

OK, I think I get it now.  Thanks!

rg
--
http://mail.python.org/mailman/listinfo/python-list


Re: What encoding does u'...' syntax use?

2009-02-20 Thread Terry Reedy

Martin v. Löwis wrote:
mehow have picked up a latin-1 encoding.)

I think latin-1 was the default without a coding cookie line.  (May be
uft-8 in 3.0).


It is, but that's irrelevant for the example. In the source

  u'\xb5'

all characters are ASCII (i.e. all of letter u, single
quote, backslash, letter x, letter b, digit 5).
As a consequence, this source text has the same meaning in all
supported source encodings (as source encodings must be ASCII
supersets).


I think I understand now that the coding cookie only matters if I use an 
editor that actually stores *non-ascii* bytes in the file for the Python 
parser to interpret.


--
http://mail.python.org/mailman/listinfo/python-list