John Wiersba <[EMAIL PROTECTED]>
> Dan, I'm confused about what you mean by Code Points. What is the datatype
> of a Code Point? If it's a 32-bit integer, then I believe the Encoding is
> irrelevant, given the above list (since, by definition, a series of code
> points is a series of 32-bit integers). Otherwise, maybe what you mean
> instead of "a series of code points" is "a series of bytes (valid in the
> given Encoding)"
There are two *really* useful numbers; amount of memory allocated to
store the data, and number of characters (lots of apps like that value
for some reason.)
> If I understand correctly, here are some definitions:
>
> - A Code Point is an uninterpreted 32-bit integer.
No. Code points are interpreted values, which are grouped together
into code sets (more properly coded character sets.) ASCII is
properly a code set.
> - An Encoding is a mapping of byte sequences to Code Points.
Yes (technically it is a bidirectional mapping...)
Note that encodings can be either nonmodal or modal. An example of a
modal encoding is JIS. An example of a nonmodal encoding is Shift-JIS.
> 32-bit integers are the canonical Encoding for Code Points.
Not necessarily, but I don't know if human languages have used 4G of
symbols in their writing systems as yet... :^)
> - The Character Set defines the subset of the Code Points which are valid
> and gives them some semantics.
That's the code set (there are also uncoded character sets, but
they're not much use with computers.)
> - The Language helps further in interpreting the semantics of each Code
> Point.
> - A Glyph is a graphic representation of a Code Point in a given Character
> Set and Language and Font.
Also depends on the surrounding code points due to the business of
ligatures. :^(
> It seems to me that you can't interpret glyph without a font. For
> example, a font may not have a glyph for "ae" or "fi", so in that
> font those characters would be represented as two glyphs.
Yes, and some languages (e.g. Arabic) use a lot of them.
If anyone is feeling confused by all this, I recommend chapters 2, 6,
7 and 8 of _Java Internationalization_ by Andrew Deitsch and David
Czarnecki (O'Reilly) which goes into *lots* of detail. Also have a
good look round http://www.unicode.org/ which explains much of the
confusing stuff.
Donal.
--
Donal K. Fellows, Department of Computer Science, University of Manchester, UK.
(work) [EMAIL PROTECTED] Tel: +44-161-275-6137 (preferred email addr.)
(home) [EMAIL PROTECTED] Tel: +44-1274-401017 Mobile: +44-7957-298955
http://www.cs.man.ac.uk/~fellowsd/ (Don't quote my .sig; I've seen it before!)