Jill Ramonsky posted:

What I mean is, it seems (to me) that there is a HUGE semantic difference
between the hexadecimal digit thirteen, and the letter D.

Yes.


There is also a HUGE semantic difference between D meaning the letter D and Roman numeral D meaning 500.

But see http://www.unicode.org/versions/Unicode4.0.0/ch14.pdf:

<< *Roman Numerals.* The Roman numerals can be composed of sequences of the appropriate Latin letters. Upper- and lowercase variants of the Roman numerals through 12, plus L, C, D, and M, have been encoded for compatibility with East Asian standards. >>

When the Unicode manual begins to talk anything being encoded for compatibility it usually means that it was *only* encoded for compatibility and otherwise would probably not have been encoded at all in Unicode because it is not needed.

Note that the chart at http://www.unicode.org/charts/PDF/U2150.pdf indictes compatibility decomposition of these characters to the regular Latin letters.

The letter _d_, though here lowercase, is also the symbol for _deci-_ in metric abbrevations. See http://www.geocities.com/Athens/Thebes/5118/metric.htm.

_D_ also often means "digital" as in _D/A_ "digital to analog" or _D-AMPS_ "Digital Advanced Mobile Phone System".

_D_ is listed at http://www.geocities.com/malaysiastamp/info/abbreviationd.html as meaning both "document" and "Pneumatic Post. Scott catalog number prefix to identify stamps other than standard postage".

If Unicode even distinguished some of these uses (and similar special uses for all letters in all scripts) by encoding them separately in Unicode, what purpose would be served? The viewers would still only see _D_ or _d_ as indeed they ought to, since that is what they should see according to normal orthography and spelling.

Most users would not enter the new proper characters in any case. Even now most fonts don't support the special Roman numeral characters, and there is no need to support them. The standard Roman letter glyphs are what are normally used.

Unicode doesn't attempt to distinguish meanings of symbols except when forced to by compatibility with older character sets or in a few cases where the same character in appearance is used sometimes as a "letter" and sometimes as "punctuation" so that applications can determine the proper beginnings and endings of words.

The semantics of the symbols is otherwise not Unicode's concern. Unicode should not define whether 302D is a hex number or a product identifier or a section identifier in a document or perhaps has some other meaning. Encoding "D" with a dfferent code won't help a reader of printed text (or even displayed text) to know what is meant. A copy typist may not know what is meant.

I notice that there are Unicode properties "Hex_Digit" and "ASCII_Hex_Digit"
which some Unicode characters possess. I may have missed it, but what I
don't see in the charts is a mapping from characters having these property
to the digit value that they represent. Is it assumed that the number of
characters having the "Hex_Digit" properties is so small that implementation
is trivial? That everyone knows it? Or have I just missed the mapping by
looking in the wrong place?

See http://www.unicode.org/Public/UNIDATA/PropList.txt:


<<
0030..0039 ; ASCII_Hex_Digit # Nd [10] DIGIT ZERO..DIGIT NINE
0041..0046 ; ASCII_Hex_Digit # L& [6] LATIN CAPITAL LETTER A..LATIN CAPITAL LETTER F
0061..0066 ; ASCII_Hex_Digit # L& [6] LATIN SMALL LETTER A..LATIN SMALL LETTER F


# Total code points: 22
>>

The property ASCII_Hex_Digit is a convenience to allow applications to identify one common use of "A", "B", "C", "D", "E" and "F" in accordance with defined properites set out in some programming langauges.

In fact it has also become common when using bases greater than 16 to extend this convention so that one can have such a number as AW3Z₃₆ in base-36 notation.

To indicate hex numbers a subscripted base indicator or a leading "&H" or the word "hex" or some other indicator of meaning is far more useful to humans than a double encoding of the same characters according to meaning.

If you can't normally see the difference in text then Unicode normally shouldn't encode any difference.

Jim Allan




Reply via email to