> -----Original Message-----
> From: Dan Sugalski [mailto:[EMAIL PROTECTED]]
> String data, generally speaking, has the following characteristics:
> a series of code points
> A character set (ASCII, EBCDIC, Unicode, whatever)
> An encoding (UTF-8, UTF-16, 32-bit integers)
> A language
> Length in bytes
> Length in code points
> Length in Glyphs
Dan, I'm confused about what you mean by Code Points. What is the datatype
of a Code Point? If it's a 32-bit integer, then I believe the Encoding is
irrelevant, given the above list (since, by definition, a series of code
points is a series of 32-bit integers). Otherwise, maybe what you mean
instead of "a series of code points" is "a series of bytes (valid in the
given Encoding)"
If I understand correctly, here are some definitions:
- A Code Point is an uninterpreted 32-bit integer.
- An Encoding is a mapping of byte sequences to Code Points. 32-bit
integers are the canonical Encoding for Code Points.
- The Character Set defines the subset of the Code Points which are valid
and gives them some semantics.
- The Language helps further in interpreting the semantics of each Code
Point.
- A Glyph is a graphic representation of a Code Point in a given Character
Set and Language and Font.
It seems to me that you can't interpret glyph without a font. For example,
a font may not have a glyph for "ae" or "fi", so in that font those
characters would be represented as two glyphs.
-- John Wiersba