At 10:55 AM 9/21/2001 -0400, John Wiersba wrote:
> > -----Original Message-----
> > From: Dan Sugalski [mailto:[EMAIL PROTECTED]]
> > String data, generally speaking, has the following characteristics:
> > a series of code points
> > A character set (ASCII, EBCDIC, Unicode, whatever)
> > An encoding (UTF-8, UTF-16, 32-bit integers)
> > A language
> > Length in bytes
> > Length in code points
> > Length in Glyphs
>
>Dan, I'm confused about what you mean by Code Points. What is the datatype
>of a Code Point?
There isn't one.
A code point is really an abstract character. For ASCII, it's a single
8-bit byte. For UTF-32 encoded data, it's a 32-bit word. For UTF-8 encoded
data, it's a variable number of bytes.
>If I understand correctly, here are some definitions:
>
>- A Code Point is an uninterpreted 32-bit integer.
Nope. Abstract integer. ASCII and EBCDIC code points are 8 bits. (RAD-50
code points are 5 1/3 bits, but we probably won't go there...) Unicode
characters are 32 bits more or less.
>- An Encoding is a mapping of byte sequences to Code Points. 32-bit
>integers are the canonical Encoding for Code Points.
We don't have a canonical representation, though the engine really prefers
either 8 or 32 bit code points.
>- The Character Set defines the subset of the Code Points which are valid
>and gives them some semantics.
Amongst other things. (Like Unicode combining characters, direction
markers, and suchlike interesting things)
>- The Language helps further in interpreting the semantics of each Code
>Point.
Points, not point. The two code point sequence ll should be interpreted as
a single unit if the language is Spanish.
>- A Glyph is a graphic representation of a Code Point in a given Character
>Set and Language and Font.
>
>It seems to me that you can't interpret glyph without a font. For example,
>a font may not have a glyph for "ae" or "fi", so in that font those
>characters would be represented as two glyphs.
I was thinking specifically of Unicode, where you have combining
characters. It's a bit specific to Unicode (though I've considered doing
the same for ASCII with embedded VT100 escape sequences. Luckily good sense
prevailed :) so it probably ought to go.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk