Re: Grapheme clusters and east asian width

Daniel Bünzli Wed, 16 Sep 2015 14:39:10 -0700

Le mercredi, 16 septembre 2015 à 21:27, Dominikus Dittes Scherkl a écrit :
> Why adding them up?
> I think every grapheme cluster of hangul syllables would have simply
> width 2 - that is the concept of CJK charakters.


I don't personally know how CJK characters behave in general w.r.t. to width, 
that's why I'm asking. I'm just trying to find a simple, best-effort, 
data-driven algorithm for the problem at-hand by using standard properties and 
possibly without making built-in assumptions about scripts.


Le mercredi, 16 septembre 2015 à 20:33, Richard Wordingham a écrit :
> Have you addressed the issue of Indic scripts? There are
> discontiguous grapheme clusters composed of indecomposable code points
> (e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code points (e.g.
> U+0BCA TAMIL VOWEL SIGN OO),  

Not sure I understand what you mean here.

> and whether consonant + virama + consonant is one cell or two may even depend 
> on the font (e.g.
> Devanagari).  

Well anything that is related to font metrics is out of scope from the point of 
view of a tty as I can't get the information. For example it seems that U+1F400 
to U+1F579 have an east-asian width of N but will actually occupy two columns 
in the built-in osx terminal; of course these scalar values are not east asian 
text per se.

> How are you handling ligatures between grapheme clusters,
> e.g. English <f, i>?  

Here again I'd need font information for that, I expect the tty not to make 
ligatures between f and i.


Of course the best way would be to be able to hand out a string to the tty for 
it to measure. But then it already seems impossible to test whether a terminal 
is able to handle UTF-8 or not…

Maybe trying to use that east asian width property, was not a good idea to 
start with.

Best,

Daniel

Re: Grapheme clusters and east asian width

Reply via email to