At 06:37 PM 2/11/02 +0000, Juliusz Chroboczek wrote: >We, ASCII-age programmers, are used to considering plain text >rendering as being injective up to binary identity. We carefully >choose fonts that distinguish between O and 0, 1 and l. We use >editors that warn us about non-native line ending conventions, about >whitespace at the end of lines, about white lines at the end of files. > >With Unicode, doing the same becomes impossible, which some of us >(including myself) find disorienting. We will have to change our work >habits, and we'll have to work out new tricks for making our software >reliable when confronted with a non-technical user.
Very well put! >We're desperately looking for the data needed to make our software >user-friendly, only to learn that no such data is available. In some cases the data is less meaningful than it appears initially, effectively changing the statement to 'no such data can be available'. At that point, it's time to shift your perspective one more time. >As far as I know, there is no data that provides: > > - a cross-reference of characters whose associated glyphs are > identical, whatever the font (applies to symbols and ``modifier > letters''); Characters of general category values S* often show less variation across fonts than other characters. I'm not sure about modifier letters... > - a cross-reference of characters whose associated glyphs could be > confused by a non-technical user; This is a one of the 'not meaningful' categories. Depending on how you derive it, it's either too broad, or too narrow. The shapes of many ASCII characters are so simple (esp. in sans-serif fonts) that they can look like characters from other scripts, not just Cyrillic and Greek, but Limbu, for example. The latter has a character that looks a bit like a Z. > - a cross-reference of characters that may, in the absence of > suitable fonts, be used as fallbacks for each oterh; Suitable for what purpose? Just to avoid the black box? Some people have mapped Greek (lower case) alpha to 'a' in such a case. Doubtful. > - a map from characters to scripts; This exists in terms of http://www.unicode.org/unicode/reports/tr24 > - a map from characters to languages. This has been attempted for some sets of latin based languages. I don't have a link to one of the documents that do that. Main problem is that many *more* characters are actually used (and used quite commonly) by users of these languages, than acknowledged by the makers of these lists. >While much of this data may be deduced from the character names, >you'll doubtless agree that many programmers would rather do something >else than working out which characters exactly can appear in a Coptic >context. As I tried to hint at above, attempting to give this answer is at best possible in a fuzzy, probabilistic sense. Even such a simple statement that 'e' is used for English, can be misleading. There's at least one novel that does entirely without that letter, but is certainly in English. A./