Hi all, Following this past conversation, I decided to reinstate rune validity checks in libutf. Since people seem to be using my repo as a submodule, I decided it was best to cater for that (somewhat questionable) use case.
> I would have liked to have separated UTF-8 and Unicode support into two > separate libraries. Unicode has changed the definitions of valid and > invalid codepoints a number of times, whilst UTF-8 has remained as it > is, unchanging. Likewise, the current version of Unicode ought not be > necessary only to parse UTF-8 sequences. However, it is clear that it is > expected that libutf will do this, and I think adding another library as > a dependency would undermine the appeal of a minimalist UTF-8 library. > > It's not a very happy situation though, since attempting to catch all > possible sources of invalid runes, rather than only those that are truly > malformed UTF-8, would require much more code if it were to detect them > at the earliest possible opportunity, as is done with things like > overlong encodings. So my solution has been to treat those as a separate > class of error, and to detect validity of the rune, as opposed to the > UTF-8 sequence, as a matter of postprocessing. > > As I say, this isn't a happy situation, but I think this is the best > compromise between those mortal enemies, pragmatism and idealism. So, to reiterate the above, I've separated out that check, so there are two distinct classes of error: UTF-8 errors, and Unicode errors. The former, which are malformed UTF-8 sequences, are detected at the earliest instant, whilst the latter, which are just invalid according to the Unicode consortium, are detected only after the rune value has been unpacked. I think that's the best compromise, such as it is. Thanks, cls