John Wiersba <[EMAIL PROTECTED]>
> Dan, I'm confused about what you mean by Code Points.  What is the datatype
> of a Code Point?  If it's a 32-bit integer, then I believe the Encoding is
> irrelevant, given the above list (since, by definition, a series of code
> points is a series of 32-bit integers).  Otherwise, maybe what you mean
> instead of "a series of code points" is "a series of bytes (valid in the
> given Encoding)"

There are two *really* useful numbers; amount of memory allocated to
store the data, and number of characters (lots of apps like that value
for some reason.)

> If I understand correctly, here are some definitions:
> 
> - A Code Point is an uninterpreted 32-bit integer.  

No.  Code points are interpreted values, which are grouped together
into code sets (more properly coded character sets.)  ASCII is
properly a code set.

> - An Encoding is a mapping of byte sequences to Code Points.

Yes (technically it is a bidirectional mapping...)

Note that encodings can be either nonmodal or modal.  An example of a
modal encoding is JIS.  An example of a nonmodal encoding is Shift-JIS.

> 32-bit integers are the canonical Encoding for Code Points.

Not necessarily, but I don't know if human languages have used 4G of
symbols in their writing systems as yet...  :^)

> - The Character Set defines the subset of the Code Points which are valid
> and gives them some semantics.  

That's the code set (there are also uncoded character sets, but
they're not much use with computers.)

> - The Language helps further in interpreting the semantics of each Code
> Point.
> - A Glyph is a graphic representation of a Code Point in a given Character
> Set and Language and Font.

Also depends on the surrounding code points due to the business of
ligatures.  :^(

> It seems to me that you can't interpret glyph without a font.  For
> example, a font may not have a glyph for "ae" or "fi", so in that
> font those characters would be represented as two glyphs.

Yes, and some languages (e.g. Arabic) use a lot of them.

If anyone is feeling confused by all this, I recommend chapters 2, 6,
7 and 8 of _Java Internationalization_ by Andrew Deitsch and David
Czarnecki (O'Reilly) which goes into *lots* of detail.  Also have a
good look round http://www.unicode.org/ which explains much of the
confusing stuff.

Donal.
-- 
Donal K. Fellows, Department of Computer Science, University of Manchester, UK.
(work) [EMAIL PROTECTED]     Tel: +44-161-275-6137  (preferred email addr.)
(home) [EMAIL PROTECTED]  Tel: +44-1274-401017   Mobile: +44-7957-298955
http://www.cs.man.ac.uk/~fellowsd/  (Don't quote my .sig; I've seen it before!)

Reply via email to