Robert Vanden Eynde writes: > As I'm at, I mentionned the ffef character but we don't care about > it because it already has a name, so that's mostly a control > character issue.
The problem with control characters is that from the point of view of the Unicode Standard, the C0 and C1 registers are basically a space reserved for private use (see ISO 6429 for the huge collection of standardized control functions). That is, unlike the rest of the Unicode repertoire, the "characters" mapped there are neither unique nor context-independent. It's true that ISO 6429 recommends specific C0 and C1 sets (but the recommended C1 set isn't even complete: U+0080, U+0081, and U+0099 aren't assigned!) However, Unicode only suggests that those should be the default interpretations, because the useful control functions are going to be dependent on context (eg, input and output devices). This is like the situation with Internet addresses and domain names. The mapping is inherently many-many; round-tripping is not possible. And in fact there are a number of graphic characters that have multiple code points due to bugs in national character sets. So for graphic characters, it's possible to ensure name(code(x)) = x, but it's not possible to ensure code(name(x)) = x, except under special circumstances (which apply to the vast majority of characters, of course). > I like your alias(...) function, with that one, an application > could code my function like try name(x) expect > alias(x).abbreviations[0]. If the abbreviation list is sorted by > AdditionToUnicodeDate. I don't understand why that's particularly useful, especially in the Han case (see below). > However, having a standard canonical name for all character in the > stdlib would help people choosing the same convention. A new > function like "canonical_name" or a shorter name would be an idea. I don't understand what you're asking for. The Unicode Standard already provides canonical names. Of course, the canonical name of most Han ideographs (near and dear to my heart) are pretty useless (they look like "CJK UNIFIED IDEOGRAPH-4E00"). (You probably don't want to get the Japanese, Chinese---and there are a lot of different kinds of Chinese---and Koreans started on what the "canonical" name should be. One Han Unification controversy is enough for this geological epoch!) This is closely related to the Unicode standard's generic recommendation (Ch. 4.8): On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise con- struct a code point label to stand in for a character name. (I suppose "should" here is used in the sense of RFC 2119.) So, the standard defines a canonical naming scheme, although many character names are not terribly mnemonic even to native speakers. On the other hand, if you want useful aliases for Han characters, for many of them there could be scores of aliases, based on pronunciation, semantics, and appearance, the first two of which of which vary substantially within a single language, let alone across languages. Worse, as far as I know there are no standard equivalent ways to express these things in English, as when writing about these characters in English you often adopt a romanized version of the technical terms in the language you're studying. And, it's a minor point, but there are new Han characters discovered every day (I'm not even sure that's an exaggeration), as scholars examine regional and historical documents. So for this to be most useful to me, I would want it developed OUTSIDE of the stdlib, with releases even more frequent than pytz (that is an exaggeration). Not so much because I'll frequently need anything outside of the main CJK block in Plane 0, but because the complexity of character naming in East Asia suggests that improvements in heuristics for assigning priority to aliases, language-specific variations in heuristics, and so on will be rapid for the forseeable future. It would be a shame to shackle that to the current stdlib release cycle even if it doesn't need to be as frenetic as pytz. This goes in spades for people who are waiting for their own scripts to be standardized. For the stdlib, I'm -1 on anything other than the canonical names plus the primary aliases for characters which are well-defined in the code charts of the Unicode Standard, such as those for the C0 and (most of) the C1 control characters. And even there I think a canonical name based on block name + code point in hex is the best way to go. I think this problem is a lot harder than many of the folk participating in this discussion so far realize. Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/