Steven D'Aprano writes: > Sorry, I'm not sure if you mean my proposed alias() function isn't > useful, or Robert's try...except loop around it.
I was questioning the utility of "If the abbreviation list is sorted by AdditionToUnicodeDate." But since you ask, neither function is useful TO ME, as I understand them, because they're based on the UCD NameAliases.txt. That doesn't have any aliases I would actually use. I've never needed aliases for control characters, and for everything else the canonical name is perfectly useful (including for Korean characters and Japanese kana, which have phonetic names, as do Chinese bopomofo AIUI). There's nothing useful for Han characters yet, sadly. > My alias() function is just an programmatic interface to information > already available in the NameAliases.txt file. Don't you think that's > useful enough as it stands? To be perfectly frank, if that's all it is, I don't know when I'd ever use it. Your label function is *much* more useful. To be specific about the defects of NameAliases.txt: "DEVICE CONTROL 1" tells me a lot less about that control character than "U+0011" does. Other aliases in that file are just wrong: I don't believe I've ever seen U+001A used as "SUBSTITUTE" for an unrepresentable coded character entity. That's the DOS "END OF FILE". Certainly, the aliases of category "correction" are useful, though not to me---I don't read any of the relevant languages. The "figment" category is stupid; almost all the names of control characters are figments, except for the half-dozen well-known whitespace characters, NUL, and maybe DEL. The 256 VSxxx "variation selectors" are somewhat useful, but I would think that it would be even more useful to provide skin color aliases for face emoji and X11 RGB.txt color aliases for hearts and the like, which presumably are standardized across vendors. If I were designing a feature for the stdlib, I would 0. Allow the database to consist of multiple alias tables, and be extensible by adding tables via user configuration. 1. Make the priority of the alias tables user-configurable. 2. Provide default top-priority table more suited to likely Python usage than NameAliases.txt. 3. Provide both a primary alias function, and a list of all aliases function. 4. Provide a reverse lookup function. 5. Perhaps provide a context-sensitive alias function. The only context I can think of offhand is "position in file", ie, to distinguish between ZWNBSP and BOM, so perhaps that's not worth doing. On the other hand, given that example, it's worth a few minutes thought to see if there are other context-sensitive naming practices that more than a few people would want to follow. > Indeed. [Multiple non-UCD aliases is] also the case for > emoji. That's why I suggested making alias() return a mutable > record rather than an immutable tuple, so application writers can > add their own records to suit their own needs. Why should they add them to the tuple returned by the function, rather than to the database the function consults? > fully-fledged PEP -- but I think the critical point here is that we > shouldn't be privileging one alias type over the others. I don't understand. By providing stdlib support for NameAliases.txt only, you are privileging those aliases. If you mean privileging the Name property over the aliases, well, that's what "canonical" means, and yes, I think the Name property should be privileged (eg ZERO WIDTH NO-BREAK SPACE over BYTE ORDER MARK). > That seems fairly extreme. New Unicode versions don't come out that > frequently. Surely we don't expect to track draft aliases, or > characters outside of Unicode? Why not track draft aliases in a "draft alias" table? More important, why not track aliases of *Unicode* characters that could use aliases (eg, translations), in separate tables? For example, there are "shape based names" for Han characters, which are standard enough so that users would be able to construct them (Unicode 11 includes one such system, see section 18.2). And Japanese names for Han radicals often vary from the UCD Name property, and are often more precise (many describe the geometric relation of the radical to the rest of the character). It is not obvious to me that an alias() that only looks at NameAliases.txt is so useful as to belong in the stdlib, but on the other hand providing a module that can include rapidly accumulating databases along the lines I've mentioned above definitely doesn't belong in the stdlib (a la pytz). On the other hand, the *access functions* might belong in the stdlib ---in the same way that timezone-sensitive datetime APIs do---but that sort of requires knowing what databases and "schema" are out there, and trying to set things up so that the same APIs can access a number of databases. > To clarify, do you mean the aliases defined in NameAliases.txt? Or a > subset of them? I didn't understand your alias function correctly, which I think is overengineered for the purpose of handling aliases. I was thinking in terms of returning a string, or at most general a list of strings. If you are going to define a class to represent metadata about a character, why not make *all* metadata available? Probably most of the attributes would be properties, lazily accessing various databases: class Codepoint(object): def __init__(self, codepoint): self.codepoint = codepoint @property def name(self): # Access name database and cache result. @property def category(self): # Access category database and cache result. @property def alias(self): # Populates alias_list, and returns the first one. @property def alias_list(self): # Access alias database (not limited to NameAliases.txt) and # cache result. @property def label(self): # Populates and returns name, if available, otherwise a code # point label. and so on. But that's a new thread. > > And even there I think a canonical name based on block name + > > code point in hex is the best way to go. > > I believe you might be thinking of the Unicode "code point label" > concept. Yes, as MRAB has suggested. I would be a little more precise than he, in that I would label the C0 and C1 control blocks with CONTROL-<code> rather than just U+<code>. Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/