On 1 June 2016 at 18:43, Kamil Cholewiński <harry6...@gmail.com> wrote: > The 95% use case here is handling UTF8-encoded Unicode text. Secure by > default should be the norm, not a magic flag, not buried in a readme.
Obviously nobody is arguing for magic flags or burying things in a readme. > If you need to encode an arbitrarily large integer into a stream of > bytes, then use a library specifically designed for encoding arbitrarily > large integers into streams of bytes. Or anything about arbitrarily large integers. > Yes, we're making up problems. I think you missed the point. The problem is not about what needs to be done, but about who needs to do what. You're saying that libutf, a UTF-8 library, should do Unicode validation. I am saying that Unicode validation is up to a Unicode library, and that a UTF-8 library should do nothing but parse UTF-8. If a UTF-8 stream is invalid, there are two possible sources for the fault. One is that it may contain 0xFE or 0xFF, or be overlong, in which case it is the UTF-8 that is at fault. Another is that it may be an invalid Unicode character, in which case it is not the UTF-8 that is at fault, but rather the Unicode -- and whether that is true is dependent on the current Unicode standard. So what we're talking about is what a UTF-8 library should do: should it validate Unicode, or just UTF-8? The distinction is that if we have a particular interface for Unicode concerns, like surrogates or graphemes, then only that interface needs to track the latest Unicode standard, as Ben explained, whilst the interfaces for handling UTF-8 or UTF-32 alone can be fixed, unchanging. This encourages a separation of concerns, which makes bugs less likely (as Unicode is a moving target), and also reduces bit rot (a UTF-8 library will not stop being valid if Unicode changes). As I said, the fundamental question is what libutf should actually do. Is it a UTF-8 library, or a Unicode library? It may be both, but then there is a strong argument that it should also support everything else Unicode requires, which is an awful lot. My recent inclination has been towards supporting only the raw encoding, and then a higher-level interface would handle the Unicode validation as well as everything else specific to this version of Unicode. That way we could tackle UTF-8 without having to bother with all of the other craziness. It would be a fixed, static library, without having to track the standard. Now, if someone wants to deal with Unicode then they could use a Unicode library, not just a UTF-8 library, as all that does is encode and decode. For all that stuff that is specific to a Unicode standard, and which is true no matter which encoding we use -- UTF-8, UTF-32, UTF-1, UTF-7, etc. -- that can all be put in a separate library. Thus, Unicode validation is in one place, not distributed amongst several interfaces which may be updated at different rates, use outdated Unicode, and all that sort of thing. Hence, a separation of concerns. So the question is whether libutf is meant to deal only with UTF-8 (which is constant), or other Unicode features too (which are dynamic). The arguments on either side are essentially: stability, or convenience? As I say, I've not yet made up my mind, but I don't think the problem is made up. Maybe we'll end up deciding to place libutf somewhere between the two, rejecting surrogates and values over 0x10FFFF but stopping short of supporting the character classes of this specific Unicode version (which has changed several times already since I first wrote libutf). At the very least, I think the discussion is worth having. cls