I believe the masking part is there because of the UTF-8 standard: https://tools.ietf.org/html/rfc3629#section-3 The first byte starts with, say n - 1, consecutive bits with value 1, and then a bit with value 0 to indicate the number of bytes to read. The remaining 8 - n bits in the first byte are then read into value. Although this doesn't really do anything in the case of n = 1 (which is when we mask with 0x7F), it is needed for all of the other cases, as you can see in the "if-else" branches, and consistency across the different cases also helps readability. I think the reader's question would be more easily answered if the reader knows what 0x7F means. 0x7F is equal to ~0x80, and likewise, in the other cases, 0x1f = ~0xe0, 0xf = ~0xf0, and so on. Seeing how there is a lot of repeated or predictable code, we could instead use macros, something like this: #define LEADING_BITS_SET(n) (((1 << n) - 1) << (8 - n)) #define BITS_AFTER(n) (~LEADING_BITS_SET(n)) ... if ((*p & LEADING_BITS_SET(n)) == LEADING_BITS_SET(n - 1)) { ... value = (*p++ & BITS_AFTER(n)) ... ... }
Alternatively, if we really want to optimize out the unnecessary masking for n = 1, we can have: #define BITS_AFTER(n) (~LEADING_BITS_SET(n - 1)) So BITS_AFTER(1) = 0xff, which I believe will get optimized out from the and instruction. In any case the first thing the reader will think is "something that's consistent with UTF-8," and the thought of "technically not necessary for the first case" may not even occur depending on how we define BITS_AFTER On Tue, Aug 27, 2013 at 2:13 AM, Michel <[email protected]> wrote: > Thanks for your comment, > but no, I didn't talk about performance. > I understand this is not very costly, especially compared with other > crypto operations. > My concern was mostly about keeping the source code easy to understand and > 'logically consistent'. > > I am trying to save the reader from asking himself "what is the use of the > '& 0x7F' masking operation ?". > > Le 25/08/2013 12:23, PMHager a écrit : > >> If your intention is performance optimization you could even replace >> >> if((*p & 0x80) == 0) >> with >> if((signed char)(*p) >= 0) >> >> as you cannot assume that all compilers will do it correctly themselves. >> >> -----Original Message----- >> From: [email protected] [mailto:owner-openssl-dev@** >> openssl.org <[email protected]>] On >> Behalf Of Michel >> Sent: Thursday, August 22, 2013 11:44 AM >> To: [email protected] >> Subject: UTF8 decoding, unneeded byte masking >> >> In a_utf8.c, lines 85 and 86 (1.0.1e) : >> >> ... >> if((*p & 0x80) == 0) { // as this byte looks like : 0xxxxxxx >> value = *p++ & 0x7f; // this line could as well be written : >> value = *p++; >> ... >> >> If I don't miss something, it would seems clearer to me. >> >> > > ______________________________**______________________________**__________ > OpenSSL Project http://www.openssl.org > Development Mailing List [email protected] > Automated List Manager [email protected] >
