> -----Original Message-----
> From: Dan Sugalski [mailto:[EMAIL PROTECTED]]
> String data, generally speaking, has the following characteristics:
>     a series of code points
>     A character set (ASCII, EBCDIC, Unicode, whatever)
>     An encoding (UTF-8, UTF-16, 32-bit integers)
>     A language
>     Length in bytes
>     Length in code points
>     Length in Glyphs

Dan, I'm confused about what you mean by Code Points.  What is the datatype
of a Code Point?  If it's a 32-bit integer, then I believe the Encoding is
irrelevant, given the above list (since, by definition, a series of code
points is a series of 32-bit integers).  Otherwise, maybe what you mean
instead of "a series of code points" is "a series of bytes (valid in the
given Encoding)"

If I understand correctly, here are some definitions:

- A Code Point is an uninterpreted 32-bit integer.  
- An Encoding is a mapping of byte sequences to Code Points.  32-bit
integers are the canonical Encoding for Code Points.  
- The Character Set defines the subset of the Code Points which are valid
and gives them some semantics.  
- The Language helps further in interpreting the semantics of each Code
Point.  
- A Glyph is a graphic representation of a Code Point in a given Character
Set and Language and Font.

It seems to me that you can't interpret glyph without a font.  For example,
a font may not have a glyph for "ae" or "fi", so in that font those
characters would be represented as two glyphs. 

-- John Wiersba

Reply via email to