On 14/6/2013 9:00 πμ, Zero Piraeus wrote:
:

On 14 June 2013 01:34, Nick the Gr33k <supp...@superhost.gr> wrote:
Why doesn't it work like this?

leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag

Wouldn't it be more logical?

Think about it. Let's say that, as per your scheme, a leading 0
indicates "1 byte" (as is indeed the case in UTF8). What things could
follow that leading 0? How does that impact your choice of a leading
00 or 01 for other numbers of bytes?

... okay, you're obviously going to need to be spoon-fed a little more
than that. Here's a byte:

   01010101

Is that a single byte representing a code point in the 0-127 range, or
the first of 4 bytes representing something else, in your proposed
scheme? How can you tell?

Indeed.

You cannot tell if it stands for 1 byte or a 4 byte sequence:

0 + 1010101 = leading 0 stands for 1byte representation of a code-point

01 + 010101 = leading 01 stands for 4byte representation of a code-point

the problem here in my scheme of how utf8 encoding works is that you cannot tell whether the flag is '0' or '01'

Same happen with leading '1' and '11'. You cannot tell what the flag is, so you cannot know if the Unicode code-point is being represented as 2-byte sequence or 6 bye sequence

Understood


Now look at the way UTF8 does it:
<http://en.wikipedia.org/wiki/Utf-8#Description>

Really, follow the link and study the table carefully. Don't continue
reading this until you believe you understand the choices that the
designers of UTF8 made, and why they made them.

Pay particular attention to the possible values for byte 1. Do you
notice the difference between that scheme, and yours:

   0xxxxxxx
   1xxxxxxx
   00xxxxxx
   01xxxxxx
   10xxxxxx
   11xxxxxx

If you don't see it, keep looking until you do ... this email gives
you more than enough hints to work it out. Don't ask someone here to
explain it to you. If you want to become competent, you must use your
brain.

0xxxxxxx
110xxxxx        10xxxxxx
1110xxxx        10xxxxxx        10xxxxxx
11110xxx        10xxxxxx        10xxxxxx        10xxxxxx

I did read the link but i still cannot see why

1. '110' is the flag for 2-byte code-point
2. why the in the 2nd byte and every subsequent byte leading flag has to be '10'

--
What is now proved was at first only imagined!
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to