Hi, Am Samstag, den 27.03.2010, 18:04 -0400 schrieb Behdad Esfahbod:
> Sure, I wasn't referring to valid data. In valid UTF-8, there is no 5byte or > 6byte sequences either. True, but that was a post-hoc restriction imposed afterwards, when Unicode was redefined as a 21-bit character set, presumably to suit the range representable by UTF-16. The 21-bit version of UCS-4 got the new name UTF-32, but UTF-8 kept its name despite the changed definition. At the time the UTF-8 decoding routines in GLib and glibmm were written, a UTF-8 sequence was still considered as up to six bytes long and able to encode a full 31-bit UCS-4 code point. I don't think its inconceivable that some day the restriction on 21 bit may be lifted again. After all, Unicode started out as 16-bit encoding and was later extended beyond that range. But even if it does not happen, no-longer-valid UTF-8 sequences of five or six bytes can be interpreted as UCS-4 code points in an unambiguous and obvious manner. And since it just happens to come out that way out of the algorithm, I see no need to artificially constrain it so that it would return something else instead. However, for other invalid conditions to result in defined behavior, explicit checks would be required in the code. I see no reason to pay the cost for insufficient validation checks in light of the fact that the documentation explicitly states that the behavior is undefined if the input is not valid UTF-8. It might be a different matter if it would write past the end of a buffer or something, but that's not the case here. Interestingly, g_utf8_get_char() is the only place where the UTF8_GET() macro is used. I guess this wasn't always the case, and that some other piece of code may have relied upon its half-checking behavior in the past. Cheers, --Daniel _______________________________________________ gtk-devel-list mailing list gtk-devel-list@gnome.org http://mail.gnome.org/mailman/listinfo/gtk-devel-list