Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller



If you are concerned with computer security


If for example I sit on a committee that devises a new encoding form, I 
would need to be concerned with the question which /sequences of Unicode 
code points/ are sound. If this is the same as "sequences of Unicode 
scalar values", I would need to exclude surrogates, if I read the 
standard correctly (this wasn't obvious to me on first inspection btw). 
If for example I sit on a committee that designs an optimized 
compression algorithm for Unicode strings (yep, I do know about SCSU), I 
might want to first convert them to some canonical internal form (say, 
my array of non-negative integers). If U+ can be 
assumed to not exist, there are 2048 fewer values a code point can 
assume; that's good for compression, and I'll subtract 2048 from those 
large scalar values in a first step. Etc etc. So I do think there are a 
number of very general use cases where this question arises.



For example, the original C datatype named "string", as it is
understood and manipulated by the C standard library, has an
/absolute/ prohibition against U+ anywhere inside.


That's not as much a prohibition as an artifact of NUL-termination of 
strings. In more modern libraries, the string contents and its 
explicit length are stored together, and you can store a 00 byte just 
fine, for example in a C++ string.


Yep.

If my question is really underspecified or ill-formed, a listing of 
possible interpretations somewhere (with case-specific answers) might be 
useful.


Stephan



Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Markus Scherer
On Fri, Jan 4, 2013 at 6:08 PM, Stephan Stiller
wrote:

> Is there a most general sense in which there are constraints beyond all
> characters being from within the range U+ ... U+10? If one is
> concerned with computer security, oddities that are absolute should raise a
> flag; somebody could be messing with my system.
>

If you are concerned with computer security, then I suggest you read
http://www.unicode.org/reports/tr36/ "Unicode Security Considerations".

For example, the original C datatype named "string", as it is understood
> and manipulated by the C standard library, has an *absolute* prohibition
> against U+ anywhere inside.
>

That's not as much a prohibition as an artifact of NUL-termination of
strings. In more modern libraries, the string contents and its explicit
length are stored together, and you can store a 00 byte just fine, for
example in a C++ string.

markus


Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller

Thanks for all the information.

Is there a most general sense in which there are constraints beyond all 
characters being from within the range U+ ... U+10? If one is 
concerned with computer security, oddities that are absolute should 
raise a flag; somebody could be messing with my system. Perhaps, for 
internal purposes, I have stored my Unicode string in an array of 
non-negative integers, and now I'm passing around this array. I don't 
know anything else about that string besides it being a Unicode string. 
There are no /absolute/ constraints against having any of those 
1114112_dec (11_hex) code points appearing anywhere, correct? Oh 
wait, actually there are the surrogates (D800 ... DFFF); perhaps I need 
to exclude them. So what else might I have overlooked? For example, the 
original C datatype named "string", as it is understood and manipulated 
by the C standard library, has an /absolute/ prohibition against U+ 
anywhere inside. UTF-32 has an /absolute/ prohibition against anything 
above 10. UTF-16 has an /absolute/ prohibition against broken 
surrogate pairs. (Or so is my understanding. Mark Davis mentioned 
"Unicode X-bit strings", but D76 (in sec. 3.9 of the standard) suggests 
that there is no place for surrogate values outside of an encoding form; 
that is: a surrogate is not a "Unicode scalar value". Perhaps "Unicode 
X-bit string" should be outside of this discussion then, or I'll need to 
read up on this more.)


Mark Davis' quote ("In effect, noncharacters can be thought of as 
application-internal private-use code points.") would really suggest 
that there are really no absolute constraints. I'm just checking that my 
understanding of the matter is correct.


Stephan



Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Mark Davis ☕
To assess whether a string is invalid, it all depends on what the string is
supposed to be.

1. As Ken says, if a string is supposed to be in a given encoding form
(UTF), but it consists of an ill-formed sequence of code units for that
encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in
UTF-16 or any surrogate (eg 0xD800) in UTF-32 would make the string
invalid. For example, a Java String may be an invalid UTF-16 string. See
http://www.unicode.org/glossary/#unicode_encoding_form

2. However, a "Unicode X-bit string" does not have the same restrictions:
it may contain sequences that would be ill-formed in the corresponding UTF-X
encoding form. So a Java String is always a valid Unicode 16-bit string.
See http://www.unicode.org/glossary/#unicode_string

3. Noncharacters are also valid in interchange, depending on the sense of
"interchange". The TUS says ""In effect, noncharacters can be thought of as
application-internal private-use code points." If I couldn't interchange
them ever, even internal to my application, or between different modules
that compose my application, they'd be pointless. They are, however,
strongly discouraged in *public* interchange. The glossary entry and some
of the standard text is a bit old here, and needs to be clarified.

4. The quotation "we select a substring that begins with a combining
character, this new string will not be a valid string in Unicode." is
wrong. It *is* a valid Unicode string. It isn't particularly useful in
isolation, but it is valid. For some *specific purpose*, any particular
string might be invalid. For example, the string mark#d might be invalid in
some systems as a password, where # is disallowed, or where passwords might
be required to be 8 characters long.




Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller
wrote:

>
>  A Unicode string in UTF-8 encoding form could be ill-formed if the bytes
>> don't follow the specification for UTF-8, for example.
>>
> Given that answer, add "in UTF-32" to my email just now, for simplicity's
> sake. Or let's simply assume we're dealing with some sort of sequence of
> abstract integers from hex+0 to hex+10, to abstract away from "encoding
> form" issues.
>
> Stephan
>
>
>


RE: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Whistler, Ken
One of the reasons why the Unicode Standard avoids the term “valid string”, is 
that it immediate begs the question, valid *for what*?

The Unicode string  is just a sequence of 3 Unicode 
characters. It is valid *for* use in internal processing, because for my own 
processing I can decide I need to use the noncharacter value U+ for some 
internal sentinel (or whatever). It is not, however, valid *for* open 
interchange, because there is no conformant way by the standard (by design) for 
me to communicate to you how to interpret U+ in that string. However, the 
string  is valid *as* a NFC-normalized Unicode string, 
because the normalization algorithm must correctly process all Unicode code 
points, including noncharacters.

The Unicode string  contains a private use character 
U+E. That is valid *for* open interchange, but it is not interpretable 
according the standard itself. It requires an external agreement as to the 
interpretation of U+E000.

The Unicode string  (“a*b”) is not valid *as* an 
identifier, because it contains a pattern-syntax character, the asterisk. 
However, it is certainly valid *for* use as an expression, for example.

And so on up the chain of potential uses to which a Unicode string could be put.

People (and particularly programmers) should not get too hung up on the notion 
of validity of a Unicode string, IMO. It is not some absolute kind of condition 
which should be tested in code with a bunch of assert() conditions every time a 
string hits an API. That way lies bad implementations of bad code. ;-)

Essentially, most Unicode string handling APIs just pass through string 
pointers (or string objects) the same way old ASCII-based programs passed 
around ASCII strings. Checks for “validity” are only done at points where they 
make sense, and where the context is available for determining what the 
conditions for validity actually are. For example, a character set conversion 
API absolutely should be checking for ill-formedness for UTF-8, for example, 
and have appropriate error-handling, as well as checking for uninterpretable 
conversions (mapping not in the table), again with appropriate error-handling.

But, on the other hand, an API which converts Unicode strings between UTF-8 and 
UTF-16, for example, absolutely should not – must not – concern itself with the 
presence of a defective combining character sequence. If it doesn’t convert the 
defective combining character sequence in UTF-8 into the corresponding 
defective combining character sequence in UTF-16, then the API is just broken. 
Never mind the fact that the defective combining character sequence itself 
might not then be valid *for* some other operation, say a display algorithm 
which detects that as an unacceptable edge condition and inserts a virtual base 
for the combining mark in order not to break the display.

--Ken




What does it mean to not be a valid string in Unicode?

Is there a concise answer in one place? For example, if one uses the 
noncharacters just mentioned by Ken Whistler ("intended for process-internal 
uses, but [...] not permitted for interchange"), what precisely does that mean? 
Naively, all strings over the alphabet {U+, ..., U+10} seem "valid", 
but section 16.7 clarifies that noncharacters are "forbidden for use in open 
interchange of Unicode text data". I'm assuming there is a set of 
isValidString(...)-type ICU calls that deals with this? Yes, I'm sure this has 
been asked before and ICU documentation has an answer, but this page
http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add them 
up. An implementation can use characters that are "invalid in interchange", but 
I wouldn't expect implementation-internal aspects of anything to be subject to 
any standard in the first place (so, why write this?). Also it makes me wonder 
about the runtime of the algorithm checking for valid Unicode strings of a 
particular length. Of course the answer is "linear" complexity-wise, but as it 
or a variation of it (depending on how one treats holes and noncharacters) will 
be dependent on the positioning of those special characters, how fast does this 
function perform in practice? This also relates to Markus Scherer's reply to 
the "holes" thread just now.

Stephan


Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller



A Unicode string in UTF-8 encoding form could be ill-formed if the bytes don't 
follow the specification for UTF-8, for example.
Given that answer, add "in UTF-32" to my email just now, for 
simplicity's sake. Or let's simply assume we're dealing with some sort 
of sequence of abstract integers from hex+0 to hex+10, to abstract 
away from "encoding form" issues.


Stephan




Re: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Stephan Stiller



What does it mean to not be a valid string in Unicode?


Is there a concise answer in one place? For example, if one uses the 
noncharacters just mentioned by Ken Whistler ("intended for 
process-internal uses, but [...] not permitted for interchange"), what 
precisely does that mean? /Naively/, all strings over the alphabet 
{U+, ..., U+10} seem "valid", but section 16.7 clarifies that 
noncharacters are "forbidden for use in open interchange of Unicode text 
data". I'm assuming there is a set of isValidString(...)-type ICU calls 
that deals with this? Yes, I'm sure this has been asked before and ICU 
documentation has an answer, but this page

http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add 
them up. An implementation can use characters that are "invalid in 
interchange", but I wouldn't expect implementation-internal aspects of 
anything to be subject to any standard in the first place (so, why write 
this?). Also it makes me wonder about the runtime of the algorithm 
checking for valid Unicode strings of a particular length. Of course the 
answer is "linear" complexity-wise, but as it or a variation of it 
(depending on how one treats holes and noncharacters) will be dependent 
on the positioning of those special characters, how fast does this 
function perform in practice? This also relates to Markus Scherer's 
reply to the "holes" thread just now.


Stephan



RE: What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Whistler, Ken
Yannis' use of the terminology "not ... a valid string in Unicode" is a little 
confusing there.

A Unicode string with the sequence, say,  (a combining grave 
mark, followed by "a"), is  "valid" Unicode in the sense that it just consists 
of two Unicode characters in a sequence. It is aberrant, certainly, but the way 
to describe that aberrancy is that the string starts with a defective combining 
character sequence (a combining mark, with no base character to apply to). And 
it would be non-conformant to the standard to claim that that sequence actually 
represented (or was equivalent to) the Latin small letter a-grave. ("à")

There is a second potential issue, which is whether any particular Unicode 
string is "ill-formed" or not. That issue comes up when examining actual code 
units laid out in memory in a particular encoding form. A Unicode string in 
UTF-8 encoding form could be ill-formed if the bytes don't follow the 
specification for UTF-8, for example. That is a separate issue from whether the 
string starts with a defective combining character sequence.

For "defective combining character sequence", see D57 in the standard. (p. 81)

For "ill-formed", see D84 in the standard. (p. 91)

http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf

--Ken

> In the book, Fonts & Encodings (p. 61, first paragraph) it says:
> 
> ... we select a substring that begins
> with a combining character, this new
> string will not be a valid string in
>  Unicode.
> 
> What does it mean to not be a valid string in Unicode?
> 
> /Roger
> 





Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Markus Scherer
On Fri, Jan 4, 2013 at 5:32 AM, Stephan Stiller
wrote:

>
>  There's no distinction between "holes" and other unassigned characters.
>>
> Good to know. This might be important knowledge for people using block
> ranges loosely for algorithms that deal with Unicode text.


It is sometimes useful to design very low-level data structures or
algorithms in alignment with some "block" boundaries, but they are really
just artifacts of previous character allocations, and for nearly all text
processing the blocks are not useful. Please use the Script and other
properties for text processing.

markus


What does it mean to "not be a valid string in Unicode"?

2013-01-04 Thread Costello, Roger L.
Hi Folks,

In the book, Fonts & Encodings (p. 61, first paragraph) it says:

... we select a substring that begins
with a combining character, this new
string will not be a valid string in
 Unicode.

What does it mean to not be a valid string in Unicode?

/Roger




Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Mark Davis ☕
http://www.unicode.org/alloc/CurrentAllocaiton.html
=>
http://www.unicode.org/alloc/CurrentAllocation.html


Mark 
*
*
*— Il meglio è l’inimico del bene —*
**


On Fri, Jan 4, 2013 at 10:24 AM, Whistler, Ken  wrote:

> Stephan Stiller continued:
>
> > Occasionally the question is asked how many characters Unicode has. This
> > question has an answer in section D.1 of the Unicode Standard. I
> > suspect, however, that once in a while the motivation for asking this
> > question is to find out how much of Unicode has been "used up". As
> > filling in holes would be dispreferred, it might be interesting to know
> > how much of Unicode has been filled if one counts partially filled
> > blocks as full. I have no reason to disagree with the (stated and
> > reiterated) opinion that our codespace won't be used up in the
> > foreseeable future, but it's simply a fun question to ask.
> >
>
> The editors maintain some statistical information relevant to this fun
> question at:
>
> http://www.unicode.org/alloc/CurrentAllocaiton.html
>
> Feel free to reference those fun facts the next time Unicode comes up in
> conversation at the bar. ;-)
>
> There have been a few notable examples where particularly egregious
> examples of holes in blocks that seemed unlikely to be filled with like
> material in the future were "reprogrammed" as it were, and grabbed for the
> encoding of unrelated sets of characters. The most notable of these is the
> range U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A
> block. There was a clear consensus in both committees that nobody wanted to
> add any more encodings for presentation forms of Arabic ligatures. So, when
> a need arose to add another range of noncharacters, the UTC simply decided
> that the otherwise unused range U+FDD0..U+FDEF could serve for that, while
> not requiring the addition of a new two-column block that could otherwise
> be used on the BMP. There are several symbol blocks on the BMP which have
> also had a somewhat colorful and creative history of "hole-filling" over
> time.
>
> --Ken
>
>
>
>


Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Michael Everson
On 4 Jan 2013, at 18:24, "Whistler, Ken"  wrote:

> There have been a few notable examples where particularly egregious examples 
> of holes in blocks that seemed unlikely to be filled with like material in 
> the future were "reprogrammed" as it were, and grabbed for the encoding of 
> unrelated sets of characters. The most notable of these is the range 
> U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A block.

Another example is the Greek block, where many holes were there just because of 
the shape of 8859-7. These have been filled up with Greek stuff unrelated to 
8859.  

Michael Everson * http://www.evertype.com/





RE: holes (unassigned code points) in the code charts

2013-01-04 Thread Whistler, Ken
Whoops!

http://www.unicode.org/alloc/CurrentAllocation.html

--Ken

> The editors maintain some statistical information relevant to this fun 
> question
> at:
> 
> http://www.unicode.org/alloc/CurrentAllocaiton.html





RE: holes (unassigned code points) in the code charts

2013-01-04 Thread Whistler, Ken
Stephan Stiller continued:

> Occasionally the question is asked how many characters Unicode has. This
> question has an answer in section D.1 of the Unicode Standard. I
> suspect, however, that once in a while the motivation for asking this
> question is to find out how much of Unicode has been "used up". As
> filling in holes would be dispreferred, it might be interesting to know
> how much of Unicode has been filled if one counts partially filled
> blocks as full. I have no reason to disagree with the (stated and
> reiterated) opinion that our codespace won't be used up in the
> foreseeable future, but it's simply a fun question to ask.
> 

The editors maintain some statistical information relevant to this fun question 
at:

http://www.unicode.org/alloc/CurrentAllocaiton.html

Feel free to reference those fun facts the next time Unicode comes up in 
conversation at the bar. ;-)

There have been a few notable examples where particularly egregious examples of 
holes in blocks that seemed unlikely to be filled with like material in the 
future were "reprogrammed" as it were, and grabbed for the encoding of 
unrelated sets of characters. The most notable of these is the range 
U+FDD0..U+FDEF in the middle of the Arabic Presentation Forms-A block. There 
was a clear consensus in both committees that nobody wanted to add any more 
encodings for presentation forms of Arabic ligatures. So, when a need arose to 
add another range of noncharacters, the UTC simply decided that the otherwise 
unused range U+FDD0..U+FDEF could serve for that, while not requiring the 
addition of a new two-column block that could otherwise be used on the BMP. 
There are several symbol blocks on the BMP which have also had a somewhat 
colorful and creative history of "hole-filling" over time.

--Ken





Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Philippe Verdy
2013/1/4 Asmus Freytag :
> On 1/4/2013 2:36 AM, Stephan Stiller wrote:
>>
>> All,
>>
>> There are plenty of unassigned code points within blocks that are in use;
>> these often come at the end of a block but there are plenty of holes as
>> well.
>>
>> I have a cluster of interrelated questions:
>> 1. What sorts of reasons are there (or have there been) for leaving holes?
>> Code page conversion and changes to casing by simple arithmetic? What else?
>
>
> There are a number of reasons why a code chart may not be contiguous besides
> the reason you give. Sometimes, a character gets removed from the draft at
> last minute, In those cases, a hole may be left. In general, the possible
> reasons for leaving a hole can not be enumerated in a fixed list. It's more
> of a case-by-case thing.

And sometimes the holes are left pending a further decision. It
remains reserved for a while as long as the proposed character has not
been formally rejected. Sometimes holes are coming from simple
mappings from legacy encodings, just to preserve the relative order.
The holes were not allocated because the legacy encoding referenced a
character already encoded elsewhere.

These holes, initially kept to preserve compatibility with simple
mappings of legacy encodings and with some fonts may be left empty for
long (even though the font assignments are normally invalid: this is
the case in the block of Windings symbols). For normal scripts
(alphabets, abjads, alphasyllabaries, sinograms, ideographs), they may
be allocated later for completely unrelated new characters in the same
script (as long as there's evidence that this script will likely
include more historic characters in the future : this is the case for
Latin, Arabic, Cyrillic, and many Indic scripts, and for blocks
containing puntuations, mathematical symbols, and pictograms like
emojis or game symbols like deck cards).

As long as a single proposal can fit in existing holes of existing
blocks, no new block would be allocated, but if the proposal contains
more characters than those that can fit in a hole, a new block will be
allocated to fit them all at once (allowing new fonts to be added to
support all of them at once, without having to update many fonts for
the full coverage of the accepted proposal, thus simplifying the
implementation, deployment and usage). Many proposals just consist in
a single or very few characters : slowly they will fill the holes left
in blocks by prior assignments.

I think that the rationale is to allow grouping together characters
that will be used together and in the same fonts (notably if there are
contextual substitution rules or ligatures).

Just look at the history of Unicode versions in the Extended Latin
blocks, and you'll find these later allocations filling holes left by
prior assignments. The roadmap also reveals some info about the
estimated number of characters for which there are pending proposals.
Very often they are referencing these holes, but these proposals will
not be concluded before a long time, and these proposals must avoid
colliding each other, competing for the same positions after the
initial encoding steps have been passed but not finalized, or the
proposal finally abandoned completely by a newer more complete
proposal. Many proposals will take months or years to be completed,
even if their blocks are already accepted and are encoding a small
part of the needed characters.



Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Stephan Stiller


It's generally not desirable, but there's no firm policy that blocks 
must have a single script value (and in fact, no such restriction 
exists in existing blocks).


If strong technical reasons exist for placing a character into the 
BMP, there will be temptation to fill a "hole" if the BMP is otherwise 
full. Likewise, many, many years (decades) from now, similar pressure 
might exist should the rest of the code space become filled.


I just noticed that filling in holes with characters that don't 
conceptually fit into the block would necessitate a different 
presentation of the blocks (by "presentation" I mean how the code charts 
are presented). Currently blocks have descriptive names. If the 
descriptions no longer fit, things would become messy.


Stephan




Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Stephan Stiller



There's no distinction between "holes" and other unassigned characters.
Good to know. This might be important knowledge for people using block 
ranges loosely for algorithms that deal with Unicode text.


2.2 If yes, how does the number of assigned code points differ, if 
holes that are assumed to be filled only by certain types of 
characters are counted?


???


Occasionally the question is asked how many characters Unicode has. This 
question has an answer in section D.1 of the Unicode Standard. I 
suspect, however, that once in a while the motivation for asking this 
question is to find out how much of Unicode has been "used up". As 
filling in holes would be dispreferred, it might be interesting to know 
how much of Unicode has been filled if one counts partially filled 
blocks as full. I have no reason to disagree with the (stated and 
reiterated) opinion that our codespace won't be used up in the 
foreseeable future, but it's simply a fun question to ask.



Unicode doesn't make mistakes. :)

Maybe I should change my legal name to "Unicode".

Stephan




Re: holes (unassigned code points) in the code charts

2013-01-04 Thread Asmus Freytag

On 1/4/2013 2:36 AM, Stephan Stiller wrote:

All,

There are plenty of unassigned code points within blocks that are in 
use; these often come at the end of a block but there are plenty of 
holes as well.


I have a cluster of interrelated questions:
1. What sorts of reasons are there (or have there been) for leaving 
holes? Code page conversion and changes to casing by simple 
arithmetic? What else?


There are a number of reasons why a code chart may not be contiguous 
besides the reason you give. Sometimes, a character gets removed from 
the draft at last minute, In those cases, a hole may be left. In 
general, the possible reasons for leaving a hole can not be enumerated 
in a fixed list. It's more of a case-by-case thing.
1.1 The rationale for particular holes is not documented in the code 
charts I looked at; is there documentation? (Yes, in some instances 
the answer can be guessed.)


In general, no. Sometimes, there's explanation in the text.
1.2 How is the number of holes determined? It seems like multiples of 
16 are used for block sizes merely for practical reasons.

Blocks end on a value ending in "F" in hexadecimal notation.
2. I notice that ranges are often used to describe where scripts are 
found. Do holes have properties? Are the other block-related policies 
that gives holes a certain semantics?


There are default values for some properties that can be applied to 
unassigned characters in order to make an algorithm "do the best" with 
as-yet-unassigned characters (so that if a new character is created, the 
algorithm doesn't have to be reimplemented necessarily but still gives 
good results).


There's no distinction between "holes" and other unassigned characters.
2.1 If not, how likely is it that Unicode assigns script-external 
characters to holes?


It's generally not desirable, but there's no firm policy that blocks 
must have a single script value (and in fact, no such restriction exists 
in existing blocks).
2.2 If yes, how does the number of assigned code points differ, if 
holes that are assumed to be filled only by certain types of 
characters are counted?


???
2.2.1 Would this make much of a difference wrt the question (this 
comes up from time to time it seems) of how much of Unicode will 
eventually fill up?


If strong technical reasons exist for placing a character into the BMP, 
there will be temptation to fill a "hole" if the BMP is otherwise full. 
Likewise, many, many years (decades) from now, similar pressure might 
exist should the rest of the code space become filled.


However, the most likely scenario is that Unicode will continue for an 
indefinite period with sufficient "open" space (and the occasional hole).

3. Have there been "mistakes" wrt to hole assignment?


Unicode doesn't make mistakes. :)

A,.


Stephan








holes (unassigned code points) in the code charts

2013-01-04 Thread Stephan Stiller

All,

There are plenty of unassigned code points within blocks that are in 
use; these often come at the end of a block but there are plenty of 
holes as well.


I have a cluster of interrelated questions:
1. What sorts of reasons are there (or have there been) for leaving 
holes? Code page conversion and changes to casing by simple arithmetic? 
What else?
1.1 The rationale for particular holes is not documented in the code 
charts I looked at; is there documentation? (Yes, in some instances the 
answer can be guessed.)
1.2 How is the number of holes determined? It seems like multiples of 16 
are used for block sizes merely for practical reasons.
2. I notice that ranges are often used to describe where scripts are 
found. Do holes have properties? Are the other block-related policies 
that gives holes a certain semantics?
2.1 If not, how likely is it that Unicode assigns script-external 
characters to holes?
2.2 If yes, how does the number of assigned code points differ, if holes 
that are assumed to be filled only by certain types of characters are 
counted?
2.2.1 Would this make much of a difference wrt the question (this comes 
up from time to time it seems) of how much of Unicode will eventually 
fill up?

3. Have there been "mistakes" wrt to hole assignment?

Stephan