On Sat, 11 Dec 2021 12:24:10 -0800 Michael Forney <mfor...@mforney.org> wrote:
Dear Michael, thanks for your input. You really know the intrinsics much better than I do. > It is true that the existence of uint32_t implies that uint_least32_t > also has exactly 32 bits and no padding bits, but they could still be > distinct types. For instance, on a 32-bit platform with int and long > both being exactly 32 bits, you could define uint32_t as one and > uint_least32_t as the other. In that case, dereferencing an array of > uint32_t as uint_least32_t would be undefined behavior. > > That said, I agree with this change. It also has the benefit of > matching the definition of C11's char32_t. That's a nice coincidence. The undefined behaviour would be okay for me, given it would be a user error. In 99% of the cases it will not be a problem, and in all cases not libgrapheme's fault which specifies the interfaces well enough, but still it's good to know. > > > diff --git a/src/utf8.c b/src/utf8.c > > index 4488359..1cb5e17 100644 > > --- a/src/utf8.c > > +++ b/src/utf8.c > > @@ -92,7 +101,7 @@ lg_utf8_decode(const uint8_t *s, size_t n, > > uint32_t *cp) > > * (i.e. between 0x80 (10000000) and 0xBF (10111111)) > > */ > > for (i = 1; i <= off; i++) { > > - if(!BETWEEN(s[i], 0x80, 0xBF)) { > > + if(!BETWEEN((unsigned char)s[i], 0x80, 0xBF)) { > > /* > > * byte does not match format; return > > * number of bytes processed excluding the > > > > Although irrelevant in C23, which will require 2's complement > representation, I want to note the distinction between (unsigned > char)s[i] and ((unsigned char *)s)[i]. The former adds 2^CHAR_BIT to > negative values, while the latter interprets as a CHAR_BIT-bit > unsigned integer (adds 2^CHAR_BIT if the sign bit is set). For > example, if char had sign-magnitude representation, we'd have > (unsigned char)"\x80"[0] == 0, but ((unsigned char *)"\x80")[0] == > 0x80. > > The latter is probably what you want, but you could ignore this if you > only care about 2's complement (which is a completely reasonable > position). Okay, maybe I misunderstood something here, but from what I understand casting between signed and unsigned char is well-defined, no matter the implementation. However, if you want to work bitwise it's only well-defined if you do it on an unsigned type (i.e. unsigned char in this case), which is why I cast to unsigned char. Where is the undefined behaviour here? Is it undefined behaviour to cast between signed and unsigned char when the value is larger than 128? > > - .arr = (uint8_t[]){ 0xFD }, > > + .arr = (char[]){ > > + (unsigned char)0xFD, > > + }, > > This cast doesn't do anything here. Both 0xFD and (unsigned char)0xFD > have the same value (0xFD), which can't necessarily be represented as > char. For example if CHAR_MAX is 127, this conversion is > implementation defined and could raise a signal (C99 6.3.1.3p2). > > I think using hex escapes in a string literal ("\xFD") has the > behavior you want here. You could also create an array of unsigned > char and cast to char *. From how I understood the standard it does make a difference. "0xFD" as is is an int-literal and it prints a warning stating that this cannot be cast to a (signed) char. However, it does not complain with unsigned char, so I assumed that the standard somehow safeguards it. But when I got it correctly, you are saying that this only works because I assume two's complement, right? So what's the portable way to work with chars? :) With best regards Laslo