Hello fellow hackers, I'm very glad to announce libgrapheme[0], a library for handling grapheme clusters. To put it short: A grapheme cluster is what Unicode considers to be a single printed character. I have given a talk about the topic and this library at slcon 2019[1], but you can also refer to [2] and [3] for further reading.
As an example, consider the family-emoji "👨👩👦". This single emoji is a single grapheme-cluster and should be printed as a single character in conforming applications, but is actually comprised of the unicode code-points man ("👨"), woman ("👩") and boy ("👦) with zero-width-joiners (U+200D) inbetween. Each code-point is encoded as UTF-8 and is thus comprised of one or more bytes, so to determine how long a grapheme cluster is, one has to decode the UTF-8 and apply a set of rules given by Unicode. And that's exactly what libgrapheme does, only that it hides the middle layer of code-points and only gives answers in byte-offsets. The above emoji example might seem irrelevant (I myself dislike emojis), but this concept is also used in many many other places, including certain representations of umlauts. For this reason, it is absolutely necessary to be able to handle grapheme clusters to work with textual input consistently. Consider that current solutions like ICU are very bloated, introduce dynamic loading and are very hard to use. libgrapheme currently only includes the function grapheme_len(const char *), which determines the length (in bytes) of the grapheme cluster beginning at the given char-pointer. Grapheme offers the following: * follows grapheme cluster rules according to the latest Unicode standard version 13.0 * automatically downloads/generates lookup-tables from unicode.org * automatically downloads/generates/runs conformance-tests from unicode.org * fully static and merely 20kB compiled This is not a release and just an initial public commit, however, the code is very stable. Feedback is greatly appreciated, especially input on the API itself! With best regards Laslo Hunhold [0]:https://git.suckless.org/libgrapheme/ [1]:https://dl.suckless.org/slcon/2019/slcon-2019-05-laslo_hunhold-reflections_on_unicode.webm [2]:https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ [3]:https://unicode.org/reports/tr29/ PS: Here is a small example to get you started (compile with $(CC) -o example -lgrapheme example.c). As you can see, it is possible for a single "visible" characters to be many bytes. It couldn't be simpler to work with it. Try doing the same with ICU and you'll see what I mean. ----------------------------------------------------------------------- #include <grapheme.h> #include <stdio.h> int main(void) { char *s = "Tëst 👨👩👦 🇺🇸 नी நி!"; size_t len; /* print each grapheme cluster with accompanying byte-length */ for (; *s != '\0';) { len = grapheme_len(s); printf("%2zu bytes | %.*s\n", len, (int)len, s, len); s += len; } return 0; } ----------------------------------------------------------------------- OUTPUT: 1 bytes | T 2 bytes | ë 1 bytes | s 1 bytes | t 1 bytes | 18 bytes | 👨👩👦 1 bytes | 8 bytes | 🇺🇸 1 bytes | 6 bytes | नी 1 bytes | 6 bytes | நி 1 bytes | ! -----------------------------------------------------------------------