Hi, On 04/15/2013 09:14 PM, Bill Spitzak wrote: > Jan Arne Petersen wrote: > >> * Changes offsets to be Unicode character instead of byte based > > No, PLEASE DON'T DO THIS!!! > > You think you are making things "easier" but you are making it much much > harder.
My main reason was that EFL, IBus and partly GTK+ were using Unicode characters as offsets and I did not want to have to specify how to handle 'invalid' byte offsets. > You may not believe it, but "how many characters are in this > UTF-8" will generate dozens of different answers and should never be > used as part of a communication api. "Unicode characters" is indeed not good enough for a protocol specification. I should have written "Unicode code points" instead. But even with that we still have the problem with invalid byte sequences. So I do not really mind using byte offsets. But we still need to think about how to handle invalid byte sequences anyways. What do we expect a toolkit to do when text with invalid byte sequences is inserted with commit_string? How to handle delete_surrounding_text with the byte offsets not matching code points? Should the toolkit ignore such requests or should we leave that as undefined behavior? > 1. A lot of things really count UTF-16 code units, not Unicode code > points, due to being designed for Windows. > > 2. Handling of invalid byte sequences. Some consider one byte a > character, some consider up to 4 bytes stopping at the first byte that > fails the UTF-8 parsing, some consider all trailing bytes no matter how > long, some consider the N bytes determined by the lead byte no matter > what they are (the first is the most common and the first two are the > only ones recommended, but the others exist, sometimes multiple rules in > the same decoder!). And don't you dare spout the nonsense that somehow > invalid byte sequences won't happen, or that if they are there it is > "not UTF-8" and thus somehow saying this means it will magically not > ever go through the API. > > 3. Disagreement about whether the encoding of UTF-16 surrogate halves, > the characters 0xNNFFFE and 0xNNFFFF, the C0 and C1 control characters, > code points greater than 0x10FFFF, etc, are "characters" or "errors". If > errors many decoders count them as 3 or 4 characters rather than one. > > 4. How to count combining characters. > > 5. How to count double-width characters, tabs, various whitespace. > > 6. Normalization. Almost anything that actually wants to decode Unicode > (other than to translate it to UTF-16 for Windows filenames) wants to do > extra analysis and will do normalization. This is hundreds of pages of > documentation from Unicode and certainly should not be part of a > low-level api. -- Jan Arne Petersen Openismus GmbH http://www.openismus.com _______________________________________________ wayland-devel mailing list wayland-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/wayland-devel