Hi Laslo, On Thu, 7 May 2020 15:15:24 +0200 Laslo Hunhold <d...@frign.de> wrote:
> On Sat, 25 Apr 2020 23:03:56 +0200 > Mattias Andrée <maand...@kth.se> wrote: > > Dear Mattias, > > > cp_decode has been renamed to grapheme_decode > > I personally don't like that change. It's already difficult enough to > understand the differences between codepoints and grapheme clusters, > and changing the name of the function to that really complicates it > more and might cause misunderstandings. Perhaps, but do you I wouldn't expect anyone that don't understand the difference to even use libgrapheme. But you would also state in grapheme.h and the man pages that all functions except grapheme_len are low-level functions. > > This also contradicts the below goal of providing a solution for > non-UTF-8-text, which I honestly don't care about. There's no logical > reason to encode text into anything other than UTF-8 and the best > approach is to just re-encode everything into UTF-8. Not a goal, but a positive side-effect of exposing the boundary test function. > > It's a tough call to decide if we want to turn libgrapheme into a > general purpose UTF-8-library. I don't hiding the code function, but since the library already used it, I thought it was nice to not have to implement your own when using the library. > > > and boundary has been renamed to grapheme_boundary. > > I'm a bit conflicted with this change, though I would probably expose > it as grapheme_boundary(uint32_t, uint32_t, int *) if I chose to > include it. Sure, we do use Codepoint interally as a typedef, but for a > "public" API I prefer not to have them if possible. I would as well, I just didn't want to change too much. > > The reason I'm conflicted with this change is that there's no guarantee > the grapheme-cluster-boundary algorithm gets changed again. It already > has been in Unicode 12.0, which made it suddenly require a state to be > carried with it, but there's no guarantee it will get even crazier, > making it almost infeasible to expose more than a "gclen()"-function to > the user. How about typedef struct grapheme_state GRAPHEME_STATE; /* Hidden from the user { */ struct grapheme_state { uint32_t cp0; int state; }; /* } */ int grapheme_boundary(uint32_t cp1, GRAPHEME_STATE *); GRAPHEME_STATE *grapheme_create_state(void); /* Just in case the state in the future * would require dynamic allocation */ void grapheme_free_state(GRAPHEME_STATE *); grapheme_create_state() would reset the state each time a boundary is found, so no reset function would be needed, and would be useful to avoid a new allocation if the grapheme cluster identification process is aborted and a a started for a new text. Since this would be very rare there, no reset function is needed. The only future I can see there this wouldn't be sufficient if a cluster break (or non-break) could be retroactively inserted where where the algorithm already stated that there as no break (or was a break). This would be so bizarre, I cannot imagine this would ever be the case. > > The current implementation can store 32 states and uses 2 of them for > the algorithm. In this regard, we still have some headroom. > > > The purpose of this is to allow faster text rendering > > where both individual code points and grapheme clusters > > boundaries are of interest, but it also (1) makes it > > easy to do online processing of large document (the user > > does not need to search for spaces, but only know an > > upper limit for how long encoding is needed to encode > > any codepoint) and (2) makes to library easy to use > > with non-UTF-8 text. > > As I said above, I don't care about non-UTF-8-text and anything > non-UTF-8 is either an internal representation (e.g. in Java with > UTF-16LE) or ancient. > > I see how the stateful function might be useful though for a > byte-per-byte reading of a file, or something else. > > > This change also eliminates all unnamespaced, non-static > > functions that are not exposed to the user. > > This is a very good point! I'll try to solve that in an own coding > session on the existing code. Thanks for your patch! I will have to > think about this more. What do you think about the following > API-overview integrating above changes? This would also include some > UTF-8 functionality. > > size_t grapheme_cp_decode(uint32_t*, char *, size_t) > Decode the UTF-8 sequence into a codepoint from the array of given > length, returning the number of bytes consumed or zero if there was > an error (which also sets the codepoint to UTF_INVALID) > > size_t grapheme_cp_encode(uint32_t, char *, size_t) > Encode the given codepoint into UTF-8 in the given array of given > length. Return the number of bytes "used" or zero if something failed > (codepoint out of bounds, array too small, etc.) > > size_t grapheme_len(const char *) > Return the length (in bytes) of the grapheme cluster beginning at > the provided char-address. > > int grapheme_boundary(uint32_t, uint32_t, int *state) > Based on the current state (which is 0 at the beginning) determine > if between two codepoints is a grapheme-cluster-boundary. If so, > return 1, and 0 otherwise. > > What do you think? I don't see the point of including grapheme_cp_encode(), however I'm not opposed to making a larger UTF-8/Unicode library, rather I think it would be nice to have one place for all my Unicode needs, especially if I otherwise would have a hand full of libraries that all their own UTF-8 decoding functions that all have to be linked. > > With best regards > > Laslo >