On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:

  >>> chr(16474)
'䁚'

Some Chinese symbol.
So code-point '䁚' has a Unicode ordinal value of 16474, correct?

Correct.


where in after encoding this glyph's ordinal value to binary gives us
the following bytes:

  >>> bin(16474).encode('utf-8')
b'0b100000001011010'

An observations here that you please confirm as valid.

1. A code-point and the code-point's ordinal value are associated into a Unicode charset. They have the so called 1:1 mapping.

So, i was under the impression that by encoding the code-point into utf-8 was the same as encoding the code-point's ordinal value into utf-8.

That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')

So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its ordinal value.


The leading 0b is just syntax to tell you "this is base 2, not base 8
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.

But byte objects are represented as '\x' instead of the aforementioned '0x'. Why is that?


> No! That creates a string from 16474 in base two:
> '0b100000001011010'

I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?


Then you encode the string '0b100000001011010' into UTF-8. There are 17
characters in this string, and they are all ASCII characters to they take
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).

0b100000001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to