At 10:55 AM 9/21/2001 -0400, John Wiersba wrote:
> > -----Original Message-----
> > From: Dan Sugalski [mailto:[EMAIL PROTECTED]]
> > String data, generally speaking, has the following characteristics:
> >     a series of code points
> >     A character set (ASCII, EBCDIC, Unicode, whatever)
> >     An encoding (UTF-8, UTF-16, 32-bit integers)
> >     A language
> >     Length in bytes
> >     Length in code points
> >     Length in Glyphs
>
>Dan, I'm confused about what you mean by Code Points.  What is the datatype
>of a Code Point?

There isn't one.

A code point is really an abstract character. For ASCII, it's a single 
8-bit byte. For UTF-32 encoded data, it's a 32-bit word. For UTF-8 encoded 
data, it's a variable number of bytes.

>If I understand correctly, here are some definitions:
>
>- A Code Point is an uninterpreted 32-bit integer.

Nope. Abstract integer. ASCII and EBCDIC code points are 8 bits. (RAD-50 
code points are 5 1/3 bits, but we probably won't go there...) Unicode 
characters are 32 bits more or less.

>- An Encoding is a mapping of byte sequences to Code Points.  32-bit
>integers are the canonical Encoding for Code Points.

We don't have a canonical representation, though the engine really prefers 
either 8 or 32 bit code points.

>- The Character Set defines the subset of the Code Points which are valid
>and gives them some semantics.

Amongst other things. (Like Unicode combining characters, direction 
markers, and suchlike interesting things)

>- The Language helps further in interpreting the semantics of each Code
>Point.

Points, not point. The two code point sequence ll should be interpreted as 
a single unit if the language is Spanish.

>- A Glyph is a graphic representation of a Code Point in a given Character
>Set and Language and Font.
>
>It seems to me that you can't interpret glyph without a font.  For example,
>a font may not have a glyph for "ae" or "fi", so in that font those
>characters would be represented as two glyphs.

I was thinking specifically of Unicode, where you have combining 
characters. It's a bit specific to Unicode (though I've considered doing 
the same for ASCII with embedded VT100 escape sequences. Luckily good sense 
prevailed :) so it probably ought to go.

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk

Reply via email to