> On May 31, 2016, at 11:33 AM, Connor Lane Smith <c...@lubutu.com> wrote: > >> On 31 May 2016 at 18:43, FRIGN <d...@frign.de> wrote: >> as a quick note, the sbase libutf is probably the most feature-rich one. >> The version by cls suffers from multiple issues, even though it might >> be the most recent. > > Strictly speaking they're all by me, since I started it (and sbase) in > the first place. But there we are. > >> I am currently working on a new libutf which is much simpler, much >> more secure (de/encoder) and actually gets the grapheme handling right. > > One of the reasons I'm not pushing for any particular solution to the > fragmentation problem is that I'm not sure what libutf should actually > do. There are three components that are distinguishable in the Plan 9 > API, which are UTF-8 (runetochar, chartorune, utf*, etc), UTF-32 > (runestr*), and Unicode (is*rune, etc). > > The trouble is I don't think it's necessary for a single library to do > all of these things. All UTF-8 is is an encoding of 31-bit integers, > and UTF-32 is another encoding. The stuff specific to Unicode, which > requires the latest Unicode database and all that, is really a > separate issue -- as is the rejection of certain values, like > surrogates or values over 0x10FFFF, both of which are only invalid > because of the braindead UTF-16 encoding. And grapheme handling is > another thing which has nothing actually to do with UTF.
I am pretty sure you are aware of this already, but the UTF-8 RFC defines Unicode quirks as part of the UTF-8 definition. Even the title is "UTF-8, a transformation format of ISO 10646". It does not call it a general purpose transformation format of 31-bit integers. I didn't glance at other definitions, if they exist. Maybe they say something else. But anyway, I am wondering why you seem to have mental pressure to generalize it more. Is it more of a design aesthetic thing? I can see that. Personally, I could see having separate functions, but I think they should be packaged together, because if someone really wanted to rip out the general pieces, they can easily do that when needed. However, I think probably every time someone consumes the interface, they are expecting it all together. I mean, if you want to be the one who makes it available in pieces for the sake of availability, then that is a valid choice. But you seem to be unsure of what to do. Me? I put them together. I have put them together before, in fact, so I have made this exact decision before. I hope this helps you in some way. :) > So in earlier versions of libutf I was vigilant in rejecting those > values that Unicode say are invalid, but in my latest version on > github I've started only rejecting overlong sequences, since the > others are still (in my view) valid UTF-8 even if they aren't valid > Unicode. Is this the right thing to do? I've not yet made up my mind. > But my feeling is that the API for reading UTF-8 should be separate > from that which deals with Unicode codepoints and graphemes that so > happen to have been encoded in UTF-8. The two are essentially > orthogonal, though are often conflated. > > Incidentally, I also changed my latest version to only ever need one > byte of lookahead. For one thing, the Plan 9 version will say that a > rune is not full even if it is, if it is malformed, which is fixed in > my implementation. But another thing, which is only in my latest > version, is that it always reads the fewest bytes needed to determine > that the sequence is malformed. One benefit of this is that if you're > reading with fgetc(), you can then ungetc() a byte that showed that > the sequence was malformed (say, it was too short), and you are only > guaranteed (by POSIX) to be able to ungetc() a single byte. > > That may not be relevant for sbase, of course, but I'm just saying > there's a reason for the slight difference in complexity between the > version in sbase and the latest version on my github. > > cls >