Your very good explanation makes me realised I was shortsighted.
Now I understand and share your point of view.

Thanks all for your interresting comments.

Le 28/08/2013 06:09, Yuan Kang a écrit :
I believe the masking part is there because of the UTF-8 standard: https://tools.ietf.org/html/rfc3629#section-3 The first byte starts with, say n - 1, consecutive bits with value 1, and then a bit with value 0 to indicate the number of bytes to read. The remaining 8 - n bits in the first byte are then read into value. Although this doesn't really do anything in the case of n = 1 (which is when we mask with 0x7F), it is needed for all of the other cases, as you can see in the "if-else" branches, and consistency across the different cases also helps readability. I think the reader's question would be more easily answered if the reader knows what 0x7F means. 0x7F is equal to ~0x80, and likewise, in the other cases, 0x1f = ~0xe0, 0xf = ~0xf0, and so on. Seeing how there is a lot of repeated or predictable code, we could instead use macros, something like this:
#define LEADING_BITS_SET(n) (((1 << n) - 1) << (8 - n))
#define BITS_AFTER(n) (~LEADING_BITS_SET(n))
...
if ((*p & LEADING_BITS_SET(n)) == LEADING_BITS_SET(n - 1)) {
  ...
  value = (*p++ & BITS_AFTER(n)) ...
  ...
}

Alternatively, if we really want to optimize out the unnecessary masking for n = 1, we can have:
#define BITS_AFTER(n) (~LEADING_BITS_SET(n - 1))

So BITS_AFTER(1) = 0xff, which I believe will get optimized out from the and instruction.

In any case the first thing the reader will think is "something that's consistent with UTF-8," and the thought of "technically not necessary for the first case" may not even occur depending on how we define BITS_AFTER

Reply via email to