Re: UTF8 decoding, unneeded byte masking

Yuan Kang Tue, 27 Aug 2013 21:11:38 -0700

I believe the masking part is there because of the UTF-8 standard:
https://tools.ietf.org/html/rfc3629#section-3
The first byte starts with, say n - 1, consecutive bits with value 1, and
then a bit with value 0 to indicate the number of bytes to read. The
remaining 8 - n bits in the first byte are then read into value. Although
this doesn't really do anything in the case of n = 1 (which is when we mask
with 0x7F), it is needed for all of the other cases, as you can see in the
"if-else" branches, and consistency across the different cases also helps
readability. I think the reader's question would be more easily answered if
the reader knows what 0x7F means. 0x7F is equal to ~0x80, and likewise, in
the other cases, 0x1f = ~0xe0, 0xf = ~0xf0, and so on. Seeing how there is
a lot of repeated or predictable code, we could instead use macros,
something like this:
#define LEADING_BITS_SET(n) (((1 << n) - 1) << (8 - n))
#define BITS_AFTER(n) (~LEADING_BITS_SET(n))
...
if ((*p & LEADING_BITS_SET(n)) == LEADING_BITS_SET(n - 1)) {
  ...
  value = (*p++ & BITS_AFTER(n)) ...
  ...
}


Alternatively, if we really want to optimize out the unnecessary masking
for n = 1, we can have:
#define BITS_AFTER(n) (~LEADING_BITS_SET(n - 1))

So BITS_AFTER(1) = 0xff, which I believe will get optimized out from the
and instruction.

In any case the first thing the reader will think is "something that's
consistent with UTF-8," and the thought of "technically not necessary for
the first case" may not even occur depending on how we define BITS_AFTER



On Tue, Aug 27, 2013 at 2:13 AM, Michel <msa...@paybox.com> wrote:

> Thanks for your comment,
> but no, I didn't talk about performance.
> I understand this is not very costly, especially compared with other
> crypto operations.
> My concern was mostly about keeping the source code easy to understand and
> 'logically consistent'.
>
> I am trying to save the reader from asking himself "what is the use of the
> '& 0x7F' masking operation ?".
>
> Le 25/08/2013 12:23, PMHager a écrit :
>
>> If your intention is performance optimization you could even replace
>>
>>    if((*p & 0x80) == 0)
>> with
>>    if((signed char)(*p) >= 0)
>>
>> as you cannot assume that all compilers will do it correctly themselves.
>>
>> -----Original Message-----
>> From: owner-openssl-...@openssl.org [mailto:owner-openssl-dev@**
>> openssl.org <owner-openssl-...@openssl.org>] On
>> Behalf Of Michel
>> Sent: Thursday, August 22, 2013 11:44 AM
>> To: openssl-dev@openssl.org
>> Subject: UTF8 decoding, unneeded byte masking
>>
>> In a_utf8.c, lines 85 and 86 (1.0.1e) :
>>
>> ...
>>    if((*p & 0x80) == 0) {     //  as this byte looks like : 0xxxxxxx
>>        value = *p++ & 0x7f;   //  this line could as well be written :
>> value = *p++;
>> ...
>>
>> If I don't  miss something, it would seems clearer to me.
>>
>>
>
> ______________________________**______________________________**__________
> OpenSSL Project                                 http://www.openssl.org
> Development Mailing List                       openssl-dev@openssl.org
> Automated List Manager                           majord...@openssl.org
>

Re: UTF8 decoding, unneeded byte masking

Reply via email to