Mark wrote: “I put together some notes on different ways for programming 
languages to handle Unicode at a low level. Comments welcome.”

Nice article as far as it goes and additions are forthcoming. In addition to 
multiple code units per character in UTF-8 and UTF-16, there are variation 
selectors, combining marks, ligatures, and clusters, all of which imply 
handling variable-length sequences even for UTF-32. Handling the variable 
length code points in UTF-8 and UTF-16 is actually considerably easier than 
dealing with these other sources of variable length. For all cases, you need to 
be able to find "character entity" boundaries for an arbitrary code-unit index.

My latest blog post “Ligatures, Clusters, Combining Marks and Variation 
Sequences<http://blogs.msdn.com/b/murrays/archive/2012/06/30/ligatures-clusters-combining-marks-and-variation-sequences.aspx>”
 discusses some of these complications.

One amusing thing is that where I work it’s common to use cp to mean “character 
position”, which more precisely is “UTF-16 code-unit index”, whereas in Mark’s 
post, cp is used for codepoint.

Murray


Reply via email to