Re: UTF8 decoding, unneeded byte masking

Michel Wed, 28 Aug 2013 01:38:41 -0700

Your very good explanation makes me realised I was shortsighted.
Now I understand and share your point of view.


Thanks all for your interresting comments.

Le 28/08/2013 06:09, Yuan Kang a écrit :

I believe the masking part is there because of the UTF-8 standard:https://tools.ietf.org/html/rfc3629#section-3The first byte starts with, say n - 1, consecutive bits with value 1,and then a bit with value 0 to indicate the number of bytes to read.The remaining 8 - n bits in the first byte are then read into value.Although this doesn't really do anything in the case of n = 1 (whichis when we mask with 0x7F), it is needed for all of the other cases,as you can see in the "if-else" branches, and consistency across thedifferent cases also helps readability. I think the reader's questionwould be more easily answered if the reader knows what 0x7F means.0x7F is equal to ~0x80, and likewise, in the other cases, 0x1f =~0xe0, 0xf = ~0xf0, and so on. Seeing how there is a lot of repeatedor predictable code, we could instead use macros, something like this:
#define LEADING_BITS_SET(n) (((1 << n) - 1) << (8 - n))
#define BITS_AFTER(n) (~LEADING_BITS_SET(n))
...
if ((*p & LEADING_BITS_SET(n)) == LEADING_BITS_SET(n - 1)) {
  ...
  value = (*p++ & BITS_AFTER(n)) ...
  ...
}
Alternatively, if we really want to optimize out the unnecessarymasking for n = 1, we can have:
#define BITS_AFTER(n) (~LEADING_BITS_SET(n - 1))
So BITS_AFTER(1) = 0xff, which I believe will get optimized out fromthe and instruction.
In any case the first thing the reader will think is "something that'sconsistent with UTF-8," and the thought of "technically not necessaryfor the first case" may not even occur depending on how we defineBITS_AFTER

Re: UTF8 decoding, unneeded byte masking

Reply via email to