Thanks, nice article. We got into some of those hair caret positioning
issues back at Apple; we even had a design that would associate a series of
lines (which could be slanted and positioned) with a ligature, but
ultimately 1/m gets you 99% of the value, with very little cost.

(My article was just targeted at the very lowest level of Unicode
representation, without getting into the further complications for higher
level constructs like grapheme clusters, ligatures, etc.)

------------------------------
Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**



On Fri, Jul 20, 2012 at 4:16 PM, Murray Sargent <
murr...@exchange.microsoft.com> wrote:

>  Mark wrote: “I put together some notes on different ways for programming
> languages to handle Unicode at a low level. Comments welcome.”****
>
> ** **
>
> Nice article as far as it goes and additions are forthcoming. In addition
> to multiple code units per character in UTF-8 and UTF-16, there are
> variation selectors, combining marks, ligatures, and clusters, all of which
> imply handling variable-length sequences even for UTF-32. Handling the
> variable length code points in UTF-8 and UTF-16 is actually considerably
> easier than dealing with these other sources of variable length. For all
> cases, you need to be able to find "character entity" boundaries for an
> arbitrary code-unit index.****
>
> ** **
>
> My latest blog post “Ligatures, Clusters, Combining Marks and Variation
> Sequences<http://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx>”
> discusses some of these complications.****
>
> ** **
>
> One amusing thing is that where I work it’s common to use cp to mean
> “character position”, which more precisely is “UTF-16 code-unit index”,
> whereas in Mark’s post, cp is used for codepoint.****
>
> ** **
>
> Murray****
>
> ** **
>
> ** **
>

Reply via email to