Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick <kwpol...@gmail.com>:
> On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.muel...@gmail.com> wrote:
> > Could someone please explain the following behavior to me:
> > Python 2.7.7, MacOS 10.9 Mavericks
> >
> > >>> import sys
> > >>> sys.getdefaultencoding()
> > 'ascii'
> > >>> [ord(c) for c in 'AÄ']
> > [65, 195, 132]
> > >>> [ord(c) for c in u'AÄ']
> > [65, 196]
> >
> > My obviously wrong understanding:
> > ‚AÄ‘ in ‚ascii‘ are two characters
> >      one with ord A=65 and
> >      one with ord Ä=196 ISO8859-1 <depends on code table>
> >      —-> why [65, 195, 132]
> > u’AÄ’ is an Unicode string
> >      —-> why [65, 196]
> >
> > It is just the other way round as I would expect.
> 
> Basically, the first string is just a bunch of bytes, as provided by your 
> terminal — which sounds like UTF-8 (perfectly logical in 2014).  The second 
> one is converted into a real Unicode representation. The codepoint for Ä is 
> U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka 
> ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly 
> forget encodings other than UTF-8.

So:
‘AÄ’ is an UTF-8 string represented by 3 bytes:
A -> 41   -> 65  first byte decimal
Ä -> c384 -> 195 and 132 second and third byte decimal

u’AÄ’ is an Unicode string represented by 2 bytes?:
A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()?
Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()?


> BTW: ASCII covers only the first 128 bytes.

ACK
-- 
Kurt Mueller, kurt.alfred.muel...@gmail.com

-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to