Jill Ramonsky posted:

I would really like it if these, and
every single other character which is "only there for reasons of round trip
compatibility" with something else, were explicity marked in the
machine-readable charts with something meaning "Don't introduce this
character, at all, ever. Don't try to interpret it. Just preserve it, in
case it ever gets turned back to its original character set".

That would probably be too strong.


If characters are available then some people will use them. :-(

See section 2.3 at http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

Unicode 3.0 contained under section D21 on compatibility characters:

<< Their use is discouraged other than for legacy data. >>

I don't know whether this statement was intentionally removed was accidently dropped in the changes in 4.0 which distinguish "compatitiblity character" from "compatibility composite character".

In any case people can't be prevent from doing things that are officially discouraged, especially as for some particular use it might be wrong to discourage them. So if you are handling Roman numerals in an application and wish your handling to be complete then unfortunately you do have to take the compatibility Roman numerals into account.

      U+2212 (minus sign) - an obvious clone of U+002D (hyphen-minus). Who
uses this?

People concerned with proper appearance of the symbol in proportional fonts. Almost all proportional fonts use a narrow hyphen dash rather than a minus-width dash for the hyphen-minus character. In some older-style fonts it is even a slanting character.


See http://www.unicode.org/versions/Unicode4.0.0/ch06.pdf in 6.2 for a detailed discussion of the various dash characters.

        U+2217 (asterisk operator) - an equally obvious clone of U+002A
(asterisk)

They look much the same in a typewriter style font. They don't do so in proportional fonts where the regular asterisk tends to appear somewhat like a superscript.


Unicode provides support both for good typographical usage as well as traditional data-processing typographical usage based based on typewriter technology.

U+223C (tilde operator) - a clone of U+007E (tilde)

See http://www.unicode.org/versions/Unicode4.0.0/ch07.pdf and look for "Spacing Clones of Diacritics".


The ASCII tilde was originally intended to be a non-spacing diacritic tilde to be applied to other characters by backspace. In part because of the low resolution of many early data-processing printers it was often realized in a tilde operator form. That has now become its most normal form in fonts.

But for good typography you do want to distinguish them and the overloading of tilde as ASCII 7E means that a font may render a mathemtical full-character tilde when you want to show a diacritic or render a spacing diacritic when you wanted a mathematical operator.

Unicode is intended for typesetting applications as well as entering computer code in a traditional typewriter style character set with typewriter limitations.

and then there's
U+2223 (divides) - hell, that looks to me remarkably like U+007C
(vertical line)

The do look close. But U+007C usually extends below the base line and and U+2223 usually doesn't.


For example:
U+2264 (less than or equal to) - compare with U+2A7D (less than or
slanted equal to)

I have no idea. You will probably have to ask the MathML people about that one. See http://www.w3.org/TR/2001/REC-MathML2-20010221. Mathematicians seem to think they need to distinguish the two.


As a non-mathematician I find many of these distinctions bewildering and seemingly only typographical. But if mathematicians in some field make fine distinctions based on such differences then it is important that Unicode allow such distinctions to be maintained in plain text.

In defence of this argument, I point out that the
complementary relation, NOT equal to, has codepoint U+2270, and this is
represented in the code charts as having a slanted equal to, so it OUGHT to
be the complement of U+2A7D. (Unless I've missed it, there appears to be no
"not equal to with horizontal equals" character).

The chart at http://www.unicode.org/charts/PDF/U2200.pdf does not show a slanted equals.


For some discussion of the math symbols see also http://www.unicode.org/unicode/reports/tr25/tr25-5.html.

Part of the problem is that differences that are in most environments only typographical style differences may indicate semantic differences in particular disciplines. It is impossible to establish a firm line as to how important or common would would normally be a stylistic variation must be before it should be encoded in Unicode for plain text distinctions.

For example open-loop _g_ is distinguished from close-loop _g_ in the International Phonetic Alphabet and so Unicode encodes it separately at U+0261.

A normal Latin Letter font would probably not have U+0261 in it at all and might display U+0067 with either closed or open loop. But a font for phonetic use should always display U+0067 with a closed loop.

Fonts like Arial Unicode MS lose the distinction.

For non-technical use people need not and mostly quite rightly will not use the more technical symbols to make fine distinctions that don't apply in their particular usage.

Jim Allan












Reply via email to