Pim Blockland posted:

Kenneth Whistler wote:

Basically, thousands of implementations, for decades now,
have been using ASCII 0x30..0x39, 0x41..0x46, 0x61..0x66 to
implement hexadecimal numbers. That is also specified in
more than a few programming language standards and other
standards. Those characters map to Unicode U+0030..U+0039,
U+0041..U+0046, U+0061..U+0066.

That's not a good reason for deciding to not implement something in the future. If everybody thought like that, there would never have been a Unicode.

You are taking Ken's statements out of context.


Unicode did not attempt to change all of past practice, but to change parts of it and build on parts of it balancing the apparent value of the changes against the disruption they would cause.

You have not provided a reason why the letters used as hex digits should be encoded separately for that particular use when they would make *no* difference in display.

Unicode encodes characters, not meanings, with a very few exceptions, most of them for compatibility reasons and a few for word division reasons.

Besides, your example is proof that the implementation can change;
has to change. Where applications could use 8-bit characters to
store hex digits in the old days, they now have to use 16-bit
characters to keep up with Unicode...

Are you actually arguing that because change happens, therefore any particular proposed change must be beneficial?


In any case applications still use one character for hex digits (and decimal digits) if using UTF-8. Double-byte character sets were already using two bytes for the hex digits. (Mixed-byte character sets were not.)

and Jim Allen wrote:
> What I mean is, it seems (to me) that there is a HUGE semantic
difference
> between the hexadecimal digit thirteen, and the letter D.

There is also a HUGE semantic difference between D meaning the
letter D
and Roman numeral D meaning 500.

and those have different code points! So you're saying Jill is
right, right?

No.


You are quoting out of context from an explanation as to why Unicode coded Roman numerals separately. See 14.3 at http://www.unicode.org/versions/Unicode4.0.0/ch14.pdf:

<< Number form characters are encoded solely for compatibility with existing standards. >>

Also

<< Roman Numerals. The Roman numerals can be composed of sequences of the appropriate Latin letters. Upper- and lowercase variants of the Roman numerals through 12, plus L, C, D, and M, have been encoded for compatibility with East Asian standards. >>

These were not encoded because the Unicode people thought they would be at all useful. They aren't at all useful.

Most fonts don't support those characters and proabably most fonts never will.

There is normally no reason to use them, unless you want to spoof people and cause difficulties in searches and have missing character glyphs or glyphs from another font in a different style from the main font appear when font changes are made.

_D_ in Roman numerals is still the character _D_. People knew it was _D_ when they wrote it and knew it was _D_ when they hand set type. They typed the _D_ key on typewriters. They typed the _D_ key on computer keyboards. And in Unicode they will mostly enter standard U+044 LATIN CAPITAL LETTER D, quite rightly, despite a needless alternate Roman numeral _D_ in some few fonts.

Similarly they know that _D_ in hex notation is the letter _D_ given a special meaning in that context. Coding separately two meanings of the same character would not be helpful.

People make enough errors in entering characters even when they can see a difference.

You seem to define "meaning" differently than what we're talking
about here.
In the abbreviation "mm" the two m's have different meanings: the
first is "milli" and the second is "meter". No one is asking to
encode those two letters with different codepoints!

Why not?


It is the same kind of difference.

It is still _m_, just with a different meaning, just as the Greek character _pi_ used in geometry for the relationship between a diamenter and circumference is still the character _pi_, the same as _c_ used for the speed of light in "E=mc˛" is still the character _c_.

Should particular semantic meanings for all characters encoded differently just because they are arithmetical or mathematical? The distinction in use appears in the context of the usage. Encoding a new character with the same appearance would indicate nothing extra.

Computers can perform mathematics with Roman numerals or hex numbers perfectly well when they know they are Roman numerals or hex numbers without any special encoding.

Anyone at any time in any descipline can assign a special meaning to a Latin letter without waiting for this meaning to be encoded in Unicode and should not expect that a clone of the character with that special meaning would ever be encoded in Unicode.

What we're talking about is different general categories, different
numeric values and even, oddly enough, different BiDi categories.
Doesn't that qualify for creating new characters?

Not unless it would be *useful*. The Greek and Hebrew letters have numeric values also. Would it be useful to encode them all twice for that reason alone?


In fact we *know* that when used for numeric values they still are the *same* characters with different semantics. Unicode encodes characters.

What benefit to encode a character twice when current usage seldom bothers and confuses anyone.

One might better encode decimal point period, decimal point comma separate from normal period and normal comma. One might better also encode abbreviation period separately from sentence-ending period. We could code right apostrophe separate from single high closing quotation mark.

But Unicode doesn't.

The fact that in an orthographic system certain symbols have multiple and inconsistant semantics is a fault of the system not the encoding. Change the system (say by demanding every hex digit have a dot over it or the setences end with a hollow circle) and then Unicode will have to follow suit. But as it is now Unicode adquately codes the orthographic system in use.

And in general it is for computer systems to make things easy for the users, not more difficult by demanding the users enter symbols for particular use that make no difference whatsoever in print or on a screen (unless one views it in special mode).

If a programming language needs a way to distinguish 25 hex from 25 decimal, it should be by a method that humans can also see. Note, as this example shows, not only would you have to add duplicates for the some letters for the alphabet, but for the numeric digits. And you will presumably have to do this again for the digits for octal use since 10 octal is 8 decimal. Then there is binary, such as 10010.

And what about base 20 if we want to count in scores?

You will need a separate set of characters for every base you want to encode. And you still won't able to tell them apart by looking at them.

On a related note, can anybody tell me why U+212A Kelvin sign was
put in the Unicode character set?
I have never seen any acknowledgement of this symbol anywhere in the
real world. (That is, using U+212A instead of U+004B.)
And even the UCD calls it a letter rather than a symbol. I'd expect
if it was put in for completeness, to complement the degrees
Fahrenheit and degree Celcius, it would have had the same category
as those two?

U+212A comes from KS C 5601 standard encoding for Korean and IBM code page 944 for Korean and possibly for some other old East Asian standard(s).


It appears to result from someone blindly including it as a Roman letter technical abbreviations in the Korean character set even though that set already had the entire standard 26-character Roman alphabet. So Unicode is stuck with it for compatibility

But Unicode assigns U+212A a canonical decomposition to normal U+004B K. That means U+212A is considered to be a duplicate of normal U+004B K. See the conformance requirements in http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf notably C9 and C10. Applications can silently replace it with U+0048 and must not assume that another application may not silently replace it with U+0048.

I see no point in ever using U+212A (except for spoofing) or retaining data exactly has encoded that has been converted from a code page that uses this character so that it can be converted back properly and any validation checksums and such will still be valid or some non-standard value for this character in a particular font will display properly.

The character U+212A within Unicode is useless.

Maybe it is time to deprecate some of these characters.

Jim Allan




Reply via email to