On 1/4/2013 2:36 AM, Stephan Stiller wrote:
All,
There are plenty of unassigned code points within blocks that are in
use; these often come at the end of a block but there are plenty of
holes as well.
I have a cluster of interrelated questions:
1. What sorts of reasons are there (or have there been) for leaving
holes? Code page conversion and changes to casing by simple
arithmetic? What else?
There are a number of reasons why a code chart may not be contiguous
besides the reason you give. Sometimes, a character gets removed from
the draft at last minute, In those cases, a hole may be left. In
general, the possible reasons for leaving a hole can not be enumerated
in a fixed list. It's more of a case-by-case thing.
1.1 The rationale for particular holes is not documented in the code
charts I looked at; is there documentation? (Yes, in some instances
the answer can be guessed.)
In general, no. Sometimes, there's explanation in the text.
1.2 How is the number of holes determined? It seems like multiples of
16 are used for block sizes merely for practical reasons.
Blocks end on a value ending in "F" in hexadecimal notation.
2. I notice that ranges are often used to describe where scripts are
found. Do holes have properties? Are the other block-related policies
that gives holes a certain semantics?
There are default values for some properties that can be applied to
unassigned characters in order to make an algorithm "do the best" with
as-yet-unassigned characters (so that if a new character is created, the
algorithm doesn't have to be reimplemented necessarily but still gives
good results).
There's no distinction between "holes" and other unassigned characters.
2.1 If not, how likely is it that Unicode assigns script-external
characters to holes?
It's generally not desirable, but there's no firm policy that blocks
must have a single script value (and in fact, no such restriction exists
in existing blocks).
2.2 If yes, how does the number of assigned code points differ, if
holes that are assumed to be filled only by certain types of
characters are counted?
???
2.2.1 Would this make much of a difference wrt the question (this
comes up from time to time it seems) of how much of Unicode will
eventually fill up?
If strong technical reasons exist for placing a character into the BMP,
there will be temptation to fill a "hole" if the BMP is otherwise full.
Likewise, many, many years (decades) from now, similar pressure might
exist should the rest of the code space become filled.
However, the most likely scenario is that Unicode will continue for an
indefinite period with sufficient "open" space (and the occasional hole).
3. Have there been "mistakes" wrt to hole assignment?
Unicode doesn't make mistakes. :)
A,.
Stephan