On Sun, May 6, 2018 at 3:37 PM, Dorota Czaplejewicz <dorota.czaplejew...@puri.sm> wrote: > On Sat, 5 May 2018 13:37:44 +0200 > Silvan Jegen <s.je...@gmail.com> wrote: > >> On Sat, May 05, 2018 at 11:09:10AM +0200, Dorota Czaplejewicz wrote: >> > On Fri, 4 May 2018 22:32:15 +0200 >> > Silvan Jegen <s.je...@gmail.com> wrote: >> > >> > > On Thu, May 03, 2018 at 10:46:47PM +0200, Dorota Czaplejewicz wrote: >> > > > On Thu, 3 May 2018 21:55:40 +0200 >> > > > Silvan Jegen <s.je...@gmail.com> wrote: >> > > > >> > > > > On Thu, May 03, 2018 at 09:22:46PM +0200, Dorota Czaplejewicz wrote: >> > > > > > On Thu, 3 May 2018 20:47:27 +0200 >> > > > > > Silvan Jegen <s.je...@gmail.com> wrote: >> > > > > > >> > > > > > > Hi Dorota >> > > > > > > >> > > > > > > Some comments and typo fixes below. >> > > > > > > >> > > > > > > On Thu, May 03, 2018 at 05:41:21PM +0200, Dorota Czaplejewicz >> > > > > > > wrote: >> > > > > > > > + Text is valid UTF-8 encoded, indices and lengths are in >> > > > > > > > code points. If a >> > > > > > > > + grapheme is made up of multiple code points, an index >> > > > > > > > pointing to any of >> > > > > > > > + them should be interpreted as pointing to the first one. >> > > > > > > >> > > > > > > That way we make sure we don't put the cursor/anchor between >> > > > > > > bytes that >> > > > > > > belong to the same UTF-8 encoded Unicode code point which is >> > > > > > > nice. It >> > > > > > > also means that the client has to parse all the UTF-8 encoded >> > > > > > > strings >> > > > > > > into Unicode code points up to the desired cursor/anchor position >> > > > > > > on each "preedit_string" event. For each >> > > > > > > "delete_surrounding_text" event >> > > > > > > the client has to parse the UTF-8 sequences before and after the >> > > > > > > cursor >> > > > > > > position up to the requested Unicode code point. >> > > > > > > >> > > > > > > I feel like we are processing the UTF-8 string already in the >> > > > > > > input-method. So I am not sure that we should parse it again on >> > > > > > > the >> > > > > > > client side. Parsing it again would also mean that the client >> > > > > > > would need >> > > > > > > to know about UTF-8 which would be nice to avoid. >> > > > > > > >> > > > > > > Thoughts? >> > > > > > >> > > > > > The client needs to know about Unicode, but not necessarily about >> > > > > > UTF-8. Specifying code points is actually an advantage here, >> > > > > > because >> > > > > > byte offsets are inherently expressed relative to UTF-8. By >> > > > > > counting >> > > > > > with code points, client's internal representation can be UTF-16 or >> > > > > > maybe even something else. >> > > > > >> > > > > Maybe I am misunderstanding something but the protocol specifies that >> > > > > the strings are valid UTF-8 encoded and the cursor/anchor offsets >> > > > > into >> > > > > the strings are specified in Unicode points. To me that indicates >> > > > > that >> > > > > the application *has to parse* the UTF-8 string into Unicode points >> > > > > when receiving the event otherwise it doesn't know after which >> > > > > Unicode >> > > > > point to draw the cursor. Of course the application can then decide >> > > > > to >> > > > > convert the UTF-8 string into another encoding like UTF-16 for >> > > > > internal >> > > > > processing (for whatever reason) but that doesn't change the fact >> > > > > that >> > > > > it still would have to parse the incoming UTF-8 (and thus know about >> > > > > UTF-8). >> > > > > >> > > > Can you see any way to avoid parsing UTF-8 in order to draw the >> > > > cursor? I tried to come up with a way to do that, but even with >> > > > specifying byte strings, I believe that calculating the position of >> > > > the cursor - either in pixels or in glyphs - requires full parsing of >> > > > the input string. >> > > >> > > Yes, I don't think it's avoidable either. You just don't have to do >> > > it twice if your text rendering library consumes UTF-8 strings with >> > > byte-offsets though. See my response below. >> > > >> > > >> > > > > > There's no avoiding the parsing either. What the application cares >> > > > > > about is that the cursor falls between glyphs. The application >> > > > > > cannot >> > > > > > know that in all cases. Unicode allows the same sequence to be >> > > > > > displayed in multiple ways (fallback): >> > > > > > >> > > > > > https://unicode.org/emoji/charts/emoji-zwj-sequences.html >> > > > > > >> > > > > > One could make an argument that byte offsets should never be close >> > > > > > to ZWJ characters, but I think this decision is better left to the >> > > > > > application, which knows what exactly it is presenting to the user. >> > > > > >> > > > > The idea of the previous version of the protocol (from my >> > > > > understanding) >> > > > > was to make sure that only valid UTF-8 and valid byte-offsets (== not >> > > > > falling between bytes of a Unicode code point) into the string will >> > > > > be >> > > > > sent to the client. If you just get a byte-offset into a UTF-8 >> > > > > encoded >> > > > > string you trust the sender to honor the protocol and thus you can >> > > > > just >> > > > > pass the UTF-8 encoded string unprocessed to your text rendering >> > > > > library >> > > > > (provided that the library supports UTF-8 strings which is what I am >> > > > > assuming) without having to parse the UTF-8 string into Unicode code >> > > > > points. >> > > > > >> > > > > Of course the Unicode code points will have to be parsed at some >> > > > > point >> > > > > if you want to render them. Using byte-offsets just lets you do that >> > > > > at >> > > > > a later stage if your libraries support UTF-8. >> > > > > >> > > > > >> > > > Doesn't that chiefly depend on what kind of the text rendering library >> > > > though? As far as I understand, passing text to rendering is necessary >> > > > to calculate the cursor position. At the same time, it doesn't matter >> > > > much for the calculations whether the cursor offset is in bytes or >> > > > code points - the library does the parsing in the last step anyway. >> > > > >> > > > I think you mean that if the rendering library accepts byte offsets >> > > > as the only format, the application would have to parse the UTF-8 >> > > > unnecessarily. I agree with this, but I'm not sure we should optimize >> > > > for this case. Other libraries may support only code points instead. >> > > > >> > > > Did I understand you correctly? >> > > >> > > Yes, that's what I meant. I also assumed that no text rendering library >> > > expects you to pass the string length in Unicode points. I had a look >> > > and the ones I managed to find expected their lengths in bytes: >> > > >> > > * Pango: >> > > https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text >> > > * Harfbuzz: https://harfbuzz.github.io/hello-harfbuzz.html >> > >> > I looked a bit deeper and found hb_buffer_add_utf8: >> > >> > https://cgit.freedesktop.org/harfbuzz/tree/src/hb-buffer.cc#n1576 >> > >> > It seems to require both (either?) the number of bytes (for buffer >> > size) and the number of code points in the same call. In this case, it >> > doesn't matter how the position information is expressed. >> >> Haha, as an API I think that's horrible... >> >> >> > > For those you would need to parse the UTF-8 string yourself first in >> > > order to find out at which byte position the Unicodepoint stops where >> > > the protocol wants you to draw the cursor (if the protocol sends Unicode >> > > point offsets). >> > > >> > > I feel like it would make sense to optimize for the more common case. I >> > > assume that is the one where you need to pass a length in bytes to the >> > > text rendering library, not in Unicode points. >> > > >> > > Admittedly, I haven't used a lot of text rendering libraries so I would >> > > very much like to hear more opinions on the issue. >> > > >> > >> > Even if some libraries expect to work with bytes, I see three >> > reasons not to provide them. Most importantly, I believe that we >> > should avoid letting people shoot themselves in the foot whenever >> > possible. Specifying bytes leaves a lot of wiggle room to communicate >> > invalid state. The supporting reason is that protocols shouldn't be >> > tied to implementation details. >> >> I agree that this is an advantage of using offsets measured in Unicode >> code points. >> >> Still, it worries me to think about how for the next 10-20 years people >> using these protocols have to parse their UTF-8 strings into Unicode >> points twice for no good reason... >> >> >> > The least important reason is that handling Unicode is getting better >> > than it used to be. Taking Python as an example: >> > >> >> That's true to some extent (personally I like Go's string and Unicode >> handling) >> but Python is a bad example IMO. Python 3 handles strings this way while >> Python 2 handels them in a completely different way: >> >> Python 2.7.15 (default, May 1 2018, 20:16:04) >> [GCC 7.3.1 20180406] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> 'æþ' >> '\xc3\xa6\xc3\xbe' >> >>> 'æþ'[1] >> '\xa6' >> >> and I am not sure either of them is easy and efficient to work with. >> >> >> > >>> 'æþ'[1] >> > 'þ' >> > >>> len('æþ'.encode('utf-8')) >> > 4 >> > >> > Strings are natively indexed with code points. This matches at least >> > my intuition when I'm asked to place a cursor somewhere inside a >> > string and tell the index. >> >> Go expects all strings to be UTF-8 encoded and they are indexed by >> byte. You can iterate over strings to get unicode points (called 'rune's >> there) should you need them: >> >> for offset, r := range "æþ" { >> fmt.Printf("start byte pos: %d, code point: %c\n", offset, r) >> } >> >> start byte pos: 0, code point: æ >> start byte pos: 2, code point: þ >> >> Using Go's approach you can treat strings as UTF-8 bytes if that's all >> you want to care about while still having an easy way to parse them into >> Unicode points if you need them. >> >> >> > In the end, I'm not an expert in that area either - perhaps treating >> > client side strings as UTF-8 buffers makes sense, but at the moment >> > I'm still leaning towards the code point abstraction. >> >> Someone (™) should probably implement a client making use of the protocol >> to see what the real world impact of this protocol change would be. >> >> The editor in the weston project uses pango for its text layout: >> >> https://cgit.freedesktop.org/wayland/weston/tree/clients/editor.c#n824 >> >> so it would have to parse the UTF-8 string twice. The same is most likely >> true for all programs using GTK... >> >> > > I made an attempt to dig deeper, and while I stopped short of becoming this > Someone for now, I gathered what I think are some important results. > > First, the state of the libraries. There's a lot of data I gathered, so I'll > keep this section rather dense. First, another contender for the title of > text layout library, and that one uses code points exclusively: > > https://github.com/silnrsi/graphite/blob/master/include/graphite2/Segment.h > `gr_make_seg` > > https://github.com/silnrsi/graphite/blob/master/tests/examples/simple.c > > Afterwards, I focused on GTK and Qt. As an input method plugin developer, I > looked at the IM interfaces and internal data structures they expose. The > results were not that clear - no mention of "code points", some references to > "bytes", many to "characters" (not "chars"). What is certain is that there's > a lot of converting going on behind the scenes anyway. First off, GTK seems > to be moving away from bytes, judging by the comments: > > gtk 3.22 (`gtkimcontext.c`) > > `gtk_im_context_delete_surrounding` > >> * Asks the widget that the input context is attached to to delete >> * characters around the cursor position by emitting the >> * GtkIMContext::delete_surrounding signal. Note that @offset and @n_chars >> * are in characters not in bytes which differs from the usage other >> * places in #GtkIMContext. > > `gtk_im_context_get_preedit_string` > >> * @cursor_pos: (out): location to store position of cursor (in characters) >> * within the preedit string. > > `gtk_im_context_get_surrounding` > >> * @cursor_index: (out): location to store byte index of the insertion >> * cursor within @text. > > gtkEntry seems to store things internally as characters. > > While GTK using code points internally is not a proof of anything, it's a > suggestion that there is a reason not to use bytes. > > Then, Qt, from https://doc.qt.io/qt-5/qinputmethodevent.html#setCommitString > >> replaceLength specifies the number of characters to be replaced > > a confirmation that "characters" means "code points" comes from > https://doc.qt.io/qt-5/qlineedit.html#cursorPosition-prop . The value > reported when "æþ|" is displayed is 2. > > I also spent more time than I should writing a demo implementation of an > input method and a client connecting to it to check out the proposed > interfaces. Predictably, it gave me a lot of trouble on the edges between > bytes and code points, but I blame it on Rust's scarcity of UTF handling > functions. The hack is available at > https://code.puri.sm/dorota.czaplejewicz/impoc > > My impression at the moment is that it doesn't matter much how offsets within > UTF strings are encoded, but that code points slightly better reflect what's > going on in the GUI toolkits, apart from the benefits mentioned in my other > emails. There seems to be so much going on behind the scenes and the parsing > is so cheap that it doesn't make sense to worry about the computational > aspect, just try to make things easier to get right. > > Unless someone chimes in with more arguments, I'm going to keep using code > points in following revisions.
I don't mean to do a drive by or bikeshed, I do actually have a vested interest in this protocol (I've implemented the previous IM protocols on Webkit For Wayland). I've really been meaning to try it out, but haven't yet had time. I also have quite a bit of experience with unicode (and specifically UTF-8) due to my day job, so I wanted to chime in... IMHO, if you are doing UTF-8 (which you should), you should *always* specify any offset in the string as a byte offset. I have a few reasons for this justification: 1. Unicode is *hard*, and it has a lot of terms that people aren't always familiar with (code points, glyphs, encodings, and the worst overloaded term "characters"). "a byte offset in UTF-8" should be universally and unambiguously understood. 2. Even if you specified the cursor offset as an index into a UTF-32 array of codepoints, you *still* could end up with the cursor "in between" a printed glyph due to combining diactiricals. 3. Due to UTF-8's self syncronizing encoding, it is actually very easy to determine if a given byte is the start of a code point, or in the middle (and even determine *which* byte in the sequence it is). Consequently, if you do find the offset is in the middle of a codepoint, it is pretty trivial to either move to the next code point, or move back to the beginning of the current code point. As such, I have always found byte a more useful offset, because it can more easily be converted to a code point than the other way around. 4. As more of a "gut feel" sort of thing.... A Wayland protocol is a pretty well defined binary API (like a networking API...), and specifying in bytes feels more "stable"... Sorry I really don't have solid data to back that up, but I would need a lot of convincing that codepoints were better if someone was proposing throwing this data in a UDP packet and blasting it across a network :) Thanks, Joshua Watt > > Cheers, > Dorota > > _______________________________________________ > wayland-devel mailing list > wayland-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/wayland-devel > _______________________________________________ wayland-devel mailing list wayland-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/wayland-devel