Your very good explanation makes me realised I was shortsighted.
Now I understand and share your point of view.
Thanks all for your interresting comments.
Le 28/08/2013 06:09, Yuan Kang a écrit :
I believe the masking part is there because of the UTF-8 standard:
https://tools.ietf.org/html/rfc3629#section-3
The first byte starts with, say n - 1, consecutive bits with value 1,
and then a bit with value 0 to indicate the number of bytes to read.
The remaining 8 - n bits in the first byte are then read into value.
Although this doesn't really do anything in the case of n = 1 (which
is when we mask with 0x7F), it is needed for all of the other cases,
as you can see in the "if-else" branches, and consistency across the
different cases also helps readability. I think the reader's question
would be more easily answered if the reader knows what 0x7F means.
0x7F is equal to ~0x80, and likewise, in the other cases, 0x1f =
~0xe0, 0xf = ~0xf0, and so on. Seeing how there is a lot of repeated
or predictable code, we could instead use macros, something like this:
#define LEADING_BITS_SET(n) (((1 << n) - 1) << (8 - n))
#define BITS_AFTER(n) (~LEADING_BITS_SET(n))
...
if ((*p & LEADING_BITS_SET(n)) == LEADING_BITS_SET(n - 1)) {
...
value = (*p++ & BITS_AFTER(n)) ...
...
}
Alternatively, if we really want to optimize out the unnecessary
masking for n = 1, we can have:
#define BITS_AFTER(n) (~LEADING_BITS_SET(n - 1))
So BITS_AFTER(1) = 0xff, which I believe will get optimized out from
the and instruction.
In any case the first thing the reader will think is "something that's
consistent with UTF-8," and the thought of "technically not necessary
for the first case" may not even occur depending on how we define
BITS_AFTER