At 06:37 PM 2/11/02 +0000, Juliusz Chroboczek wrote:

>We, ASCII-age programmers, are used to considering plain text
>rendering as being injective up to binary identity.  We carefully
>choose fonts that distinguish between O and 0, 1 and l.  We use
>editors that warn us about non-native line ending conventions, about
>whitespace at the end of lines, about white lines at the end of files.
>
>With Unicode, doing the same becomes impossible, which some of us
>(including myself) find disorienting.  We will have to change our work
>habits, and we'll have to work out new tricks for making our software
>reliable when confronted with a non-technical user.

Very well put!

>We're desperately looking for the data needed to make our software
>user-friendly, only to learn that no such data is available.

In some cases the data is less meaningful than it appears initially,
effectively changing the statement to 'no such data can be available'.
At that point, it's time to shift your perspective one more time.

>As far as I know, there is no data that provides:
>
>   - a cross-reference of characters whose associated glyphs are
>     identical, whatever the font (applies to symbols and ``modifier
>     letters'');

Characters of general category values S* often show less variation across 
fonts than other characters. I'm not sure about modifier letters...

>   - a cross-reference of characters whose associated glyphs could be
>     confused by a non-technical user;

This is a one of the 'not meaningful' categories. Depending on how you 
derive it, it's either too broad, or too narrow. The shapes of many ASCII 
characters are so simple (esp. in sans-serif fonts) that they can look like 
characters from other scripts, not just Cyrillic and Greek, but Limbu, for 
example. The latter has a character that looks a bit like a Z.

>   - a cross-reference of characters that may, in the absence of
>     suitable fonts, be used as fallbacks for each oterh;

Suitable for what purpose? Just to avoid the black box? Some people have 
mapped Greek (lower case) alpha to 'a' in such a case. Doubtful.

>   - a map from characters to scripts;


This exists in terms of http://www.unicode.org/unicode/reports/tr24

>   - a map from characters to languages.

This has been attempted for some sets of latin based languages. I don't
have a link to one of the documents that do that. Main problem is that
many *more* characters are actually used (and used quite commonly) by users
of these languages, than acknowledged by the makers of these lists.

>While much of this data may be deduced from the character names,
>you'll doubtless agree that many programmers would rather do something
>else than working out which characters exactly can appear in a Coptic
>context.

As I tried to hint at above, attempting to give this answer is at best 
possible in a fuzzy, probabilistic sense. Even such a simple statement that 
'e' is used for English, can be misleading. There's at least one novel that 
does entirely without that letter, but is certainly in English.

A./

Reply via email to