Re: Non-characters in Unicode data files

Markus Scherer Mon, 29 Dec 2003 13:34:48 -0800

Philippe Verdy wrote:

I note that the UCD contains lines for PUAs like this:
...
E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;
...
But why isn't there lines for the _assigned_ Private Local-Use characters in


1. No one saw a need to include them?
2. The documentation file points out that Cn entries are not included:
  http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
3. See DerivedAge.txt which I point out below.

the Arabic compatibility block, like:
...
FDD0;<Private Local-Use, First;Cn;0;L;;;;;N;;;;;
FDEF;<Private Local-Use, First;Cn;0;L;;;;;N;;;;;
...
which seem related and used only for local processing of contextual forms,
and not restricted to local rendering of Arabic ?

I think it is a legitimate question why the block boundaries were not adjusted to exclude this non-character range from FB50..FDFF; Arabic Presentation Forms-A (see Blocks.txt).

However, the Unicode standard only points these out as generic non-characters, not for any particular purpose like "local processing of contextual forms".

For now, even if it's specified in the text of the standard, it does not
clearly shows that these characters are assigned but invalid in all versions
of Unicode, unlike other missing code-points which may be assigned later and
should not be considered as invalid.

Unicode 3.1 (http://www.unicode.org/reports/tr27/) clarified their usage. See "3.1 Conformance Requirements (revision)" and then the heading "Noncharacters" a page or so below, including the definition D7b Noncharacter. See the equivalent parts of Unicode 4.

Noncharacters are not "invalid", but they are "designated" and can therefore not be reassigned: http://www.unicode.org/alloc/CurrentAllocation.html

I personally find useful the chart for [91-C31] Consensus in http://www.unicode.org/consortium/utc-minutes/UTC-091-200205.html

Other non-characters are also absent from the file (which does not contain
in fact any "Cn" characters), and I wonder why they are not listed:

> ...

See my quote above from UCD.html

I think that, if these codepoints are effectively permanently assigned as
invalid, these assignments should be listed.

Another solution would be to list these non-characters in
DerivedCoreProperties.txt

Well, they are listed in http://www.unicode.org/Public/UNIDATA/DerivedAge.txt If you search for "noncharacter" there, you will find which ones were designated in which Unicode version. (Only two were designated in Unicode 1.)

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Re: Non-characters in Unicode data files

Reply via email to