Replying to a few points out of order... On Thu, Jul 12, 2018 at 02:03:07AM +0000, Robert Vanden Eynde wrote:
> lookup(name(x)) == x for all x is natural isn't it ? The Unicode Consortium doesn't think so, or else they would mandate that all defined code points have a name. > In the NameAliases > https://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt > one can see that some characters have multiple aliases, so there are > multiple ways to map a character to a name. That's a pretty old version -- we're up to version 11 now. https://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt > I propose adding a keyword argument, to > unicodedata.name<http://unicodedata.name> I don't think that's a real URL. > that would implement one of some useful behavior when the value does > not exist. I am cautious about overloading functions with keyword-only arguments to implement special behaviour. Guido has a strong preference for the "no constant flags" rule of thumb, (except I think we can extend it beyond just True/False to any N-state value) and I agree with that. The rule of thumb says that if you have a function that takes an optional flag which chooses between two (or more) distinct behaviours, AND the function is usually called with that flag given as a constant, then we should usually prefer to split the function into two separately named functions. For example, in the statistics module, I have stdev() and pstdev(), rather than stdev(population=False) and stdev(population=True). (Its a rule of thumb, not a hard law of nature. There are exceptions.) It sounds to me that your proposal would fit those conditions and so we should prefer a separate function, or a separate API, for doing more complex name look-ups. *Especially* if there's a chance that we'll want to extend this some day to use more flags... name(char, abbreviation=False, correction=True, control=True, figment=True, alternate=False, ) which are all alias types defined by NameAliases.txt. > One simple behavior would be to chose the name in the "abbreviation" > list. Currently all characters except three only have one and only one > abbreviation so that would be a good pick, so I'd imagine name('\x00', > abbreviation=True) == 'NUL' To my mind, that calls out for a separate API to return character alias properties as a separate data type: alias('\u0001') => UnicodeAlias(control='START OF HEADING', abbreviation='SOH') alias('\u000B') => UnicodeAlias(control=('LINE TABULATION', 'VERTICAL TABULATION'), abbreviation='VT') # alternatively, fields could be a single semi-colon delimited string rather than a tuple in the event of multiple aliases alias('\u01A2') => UnicodeAlias(correction='LATIN CAPITAL LETTER GHA') alias('\u0099') => UnicodeAlias(figment='SINGLE GRAPHIC CHARACTER INTRODUCER', abbreviation='SGC') Fields not shown return the empty string. This avoids overloading the name() function, future-proofs against new alias types, and if UnicodeAlias is a mutable object, easily permits the caller to customise the records to suit their own application's needs: def myalias(char): alias = unicodedata.alias(char) if char == '\U0001f346': alias.other = ('eggplant', 'purple vegetable') alias.slang = ('phallic', ... ) return alias -- Steve _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/