Tom Christiansen <tchr...@perl.com> added the comment: Ezio Melotti <rep...@bugs.python.org> wrote on Sun, 02 Oct 2011 06:46:26 -0000:
> Actually Python doesn't seem to support \N{LINE FEED (LF)}, most likely bec= > ause that's a Unicode 1 name, and nowadays these codepoints are simply mark= > ed as '<control>'. Yes, but there are a lot of them, 65 of them in fact. I do not care to see people being forced to use literal control characters or inscrutable magic numbers. It really bothers me that you have all these defined code points with properties and all that have no name. People do use these. Some of them a lot. I don't mind \n and such -- and in fact, prefer them even -- but I feel I should not have scratch my head over character \033, \0177, and brethren. The C0 and C1 standards are not just inventions, so we use them. Far better than one should write \N{ESCAPE} for \033 or \N{DELETE} for \0177, don't you think? >> If so, then I don't understand that. Nobody in their right=20 >> mind prefers "\N{LINE FEED (LF)}" over "\N{LINE FEED}" -- do they? > They probably don't, but they just write \n anyway. I don't think we need = > to support any of these aliases, especially if they are not defined in the = > Unicode standard. If you look at Names.txt, there are significant "aliases" there for the C0/C1 stuff. My bottom line is that I don't like to be forced to use magic numbers. I prefer to name my abstactions. It is more readable and more maintainble that way. There are still "holes" of course. Code point 128 has no name even in C1. But something is better than nothing. Plus at least in Perl we *can* give things names if we want, per the APPLE LOGO example for U+F8FF. So nothing needs to remain nameless. Why, you can even name your Kanji if you want, using whatever Romanization you prefer. I think the private-use case example is really motivating, but I have no idea how to do this for Python because there is no lexical scope. I suppose you could attach it to the module, but that still doesn't really work because of how things get evaluated. With a Perl compile-time use, we can change the compiler's ideas about things, like adding function prototypes and even extending the base types: % perl -Mbigrat -le 'print 1/2 + 2/3 * 4/5' 31/30 % perl -Mbignum -le 'print 21->is_odd' 1 % perl -Mbignum -le 'print 18->is_odd' 0 % perl -Mbignum -le 'print substr(2**5000, -3)' 376 % perl -Mbignum -le 'print substr(2**5000-1, -3)' 375 % perl -Mbignum -le 'print length(2**5000)' 1506 % perl -Mbignum -le 'print length(10**5000)' 5001 % perl -Mbignum -le 'print ref 10**5000' Math::BigInt % perl -Mbigrat -le 'print ref 1/3' Math::BigRat I recognize that redefining what sort of object the compiler treats some of its constants as is never going to happen in Python, but we actually did manage that with charnames without having to subclass our strings: the hook for \N{...} doesn't require object games like the ones above. But it still has to happen at compile time, of course, so I don't know what you could do in Python. Is there any way to change how the compiler behaves even vaguely along these lines? The run-time looks of Python's unicodedata.lookup (like Perl's charnames::viacode) and unicodedata.name (like Perl's charnames::viacode on the ord) could be managed with a hook, but the compile-time lookups of \N{...} I don't see any way around. But I don't know anything about Python's internals, so don't even know what is or is not possible. I do note that if you could extend \N{...} the way we do with charname aliases for private-use characters, the user could load something that did the C0 and C1 control if they wanted to. I just don't know how to do that early enough that the Python compiler would see it. Your import happens at run-time or at compile-time? This would be some sort of compile-time binding of constants. d=20 >> Python doesn't require it. :)/2 > I actually find those *less* readable. If there's something fancy in the r= > egex, a comment *before* it is welcomed, but having to read a regex divided= > on several lines and remove meaningless whitespace and redundant comments = > just makes the parsing more difficult for me. Really? White space makes things harder to read? I thought Pythonistas believed the opposite of that. Whitespace is very useful for cognitive chunking: you see how things logically group together. Inomorewantaregexwithoutwhitespacethananyothercodeortext. :) I do grant you that chatty comments may be a separate matter. White space in patterns is also good when you have successive patterns across multiple lines that have parts that are the same and parts that are different, as in most of these, which is from a function to render an English headline/book/movie/etc title into its proper casing: # put into lowercase if on our stop list, else titlecase s/ ( \pL [\pL']* ) /$stoplist{$1} ? lc($1) : ucfirst(lc($1))/xge; # capitalize a title's last word and its first word s/^ ( \pL [\pL']* ) /\u\L$1/x; s/ ( \pL [\pL']* ) $/\u\L$1/x; # treat parenthesized portion as a complete title s/ \( ( \pL [\pL']* ) /(\u\L$1/x; s/ ( \pL [\pL']* ) \) /\u\L$1)/x; # capitalize first word following colon or semi-colon s/ ( [:;] \s+ ) ( \pL [\pL']* ) /$1\u\L$2/x; Now, that isn't good code for all *kinds* of reasons, but white space is not one of them. Perhaps what it is best at demonstrating is why Python goes about this the right way and that Perl does not. Oh drat, I'm about to attach this to the wrong bug. But it was the dumb code above that made me think about the following. By virtue of having a "titlecase each word's first letter and lowercase the rest" function in Python, you can put the logic in just one place, and therefore if a bug is found, you can fix all code all at one. But because Perl has always made it easy to grab "words" (actually, traditional programming language identifiers) and diddle their case, people write this all the time: s/(\w+)/\u\L$1/g; all the time, and that has all kind of problems. If you prefer the functional approach, that is really s/(\w+)/ucfirst(lc($1))/ge; but that is still wrong. 1. Too much code duplication. Yes, it's nice to see \pL[\pL']* stand out on each line, but shouldn't that be in a variable, like $word = qr/\pL[\pL']*/; 2. What is a "word"? That code above is better than \w because it avoids numbers and underscores; however, it still uses letters only, not letters and marks, let alone number letters like Roman numerals. 3. I see the apostrophe there, which is a good start, but what if it is a RIGHT SINGLE QUOTATION MARK, as in "Henry’s"? And what about hyphens? Those should not trigger capitalization in normal titles. 4. It turns out that all code that does a titlecase on the first character of a string it has already converted to lowercase has irreversibly lost information. Unicode casing it not reversable. Using \w for convenience, these can do different things: s/(\w+)/\u\L$1/g; s/(\w)(\w*)/\u$1\L$2/g; or in the functional approach, s/(\w+)/ucfirst(lc($1))/ge; s/(\w)(\w*)/ucfirst($1) . lc($2)/ge; Now while it is true that only these code points alone do the wrong thing using the naïve approach under Unicode 6.0: % unichars -gas 'ucfirst ne ucfirst lc' İ U+00130 GC=Lu SC=Latin LATIN CAPITAL LETTER I WITH DOT ABOVE ϴ U+003F4 GC=Lu SC=Greek GREEK CAPITAL THETA SYMBOL ẞ U+01E9E GC=Lu SC=Latin LATIN CAPITAL LETTER SHARP S Ω U+02126 GC=Lu SC=Greek OHM SIGN K U+0212A GC=Lu SC=Latin KELVIN SIGN Å U+0212B GC=Lu SC=Latin ANGSTROM SIGN But it is still the wrong thing, and we never know what might happen in the future. I think Python is being smarter than Perl in simply providing people with a titlecase-each-word('s-first-letterand-lowercase-the-rest)in-the-whole- string function, because this means people won't be tempted to write s/(\w+)/ucfirst(lc($1))/ge; all the time. However, as I have written elsewhere, I question a lot of its underlying assumptions. It's clear that a "word" must in general include not just Letters but also Marks, or else you get different results in NFD and NFC, and the Unicode Standard is very against that. However, the problem is that what a word is cannot be considered independent of language. Words in English can contain apostrophes (whether written as an APOSTROPHE or as RIGHT SINGLE QUOTATION MARK) and hyphens (written as HYPHEN-MINUS, HYPHEN, and rarely even EN DASH). Each of these is a single word: ’tisn’t anti‐intellectual earth–moon The capitalization there should be ’Tisn’t Anti‐intellectual Earth–Moon Notice how you can't do the same with the first apostrophe+t as with the second on "’Tisn’t"". That is all challenging to code correctly (did you notice the EN DASH?), especially when you find something like red‐violet–colored. You problably want that to be Red‐violet–colored, because it is not an equal compound like earth–moon or yin–yang, which in correct orthography take an EN DASH not a HYPHEN, just as occurs when you hyphenate an already hyphenated word like red‐violet against colored, as in a red‐violet–colored flower. English titling rules only capitalize the first word in hyphenated words, which is why it's Anti‐intellectual not Anti-Intellectual. And of course, you can't actually create something in true English titlecase without knowing having a stop list of articles and (short) prepositions, and paying attention to whether it is the first or last word in the title, and whether it follows a colon or semicolon. Consider that phrasal verbs are construed to take adverbs not prepositions, and so "Bringing In the Sheaves" would be the correct capitalization of that song, since "to bring in" is a phrasal verb, but "A Ringing in My Ears" would be right for that. It is remarkably complicated. With English titlecasing, you have to respect what your publishing house considers a "short" preposition. A common cut-off is that short preps have 4 or fewer characters, but I have seen longer cutoffs. Here is one rather exhaustive list of English prepositions sorted by length: 2: as at by in of on to up vs 3: but for off out per pro qua via 4: amid atop down from into like near next onto over pace past plus sans save than till upon with <cutoff point for O'Reilly Media> 5: about above after among below circa given minus round since thru times under until worth 6: across amidst around before behind beside beside beyond during except inside toward unlike versus within 7: against barring beneath besides between betwixt despite failing outside through thruout towards without 10: throughout underneath The thing is that prepositions become adverbs in phrasal verbs, like "to go out" or "to come in", and all adverbs are capitalized. So a complete solution requires actual parsing of English!!!! Just say no -- or stronger. Merely getting something like this right: the lord of the rings: the fellowship of the ring # Unicode lowercase THE LORD OF THE RINGS: THE FELLOWSHIP OF THE RING # Unicode uppercase The Lord of the Rings: The Fellowship of the Ring # English titlecase is going to take a bit of work. So is the sad tale of king henry ⅷ and caterina de aragón # Unicode lowercase THE SAD TALE OF KING HENRY Ⅷ AND CATERINA DE ARAGÓN # Unicode uppercase The Sad Tale of King Henry Ⅷ and Caterina de Aragón # English titlecase (and that must give the same answer in NFC vs NFD, of course.) Plus what to do with something like num2ascii is ill-defined in English, because having digits in the middle of a word is a very new phenomenon. Yes, Y2K gets caps, but that is for another reason. There is no agreement on what one should do with num2ascii or people42see. A function name shouldn't be capitalized at all of course. And that is just English. Other languages have completely different rules. For example, per Wikipedia's entry on the colon: In Finnish and Swedish, the colon can appear inside words in a manner similar to the English apostrophe, between a word (or abbreviation, especially an acronym) and its grammatical (mostly genitive) suffixes. In Swedish, it also occurs in names, for example Antonia Ax:son Johnson (Ax:son for Axelson). In Finnish it is used in loanwords and abbreviations; e.g., USA:han for the illative case of "USA". For loanwords ending orthographically in a consonant but phonetically in a vowel, the apostrophe is used instead: e.g. show'n for the genitive case of the English loan "show" or Versailles'n for the French place name Versailles. Isn't that tricky! I guess that you would have to treat punctuation that has a word character immediately following it (and immediately preceding it) as being part of the word, and that it doesn't signal that a change in case is merited. I'm really not sure. It is not obvious what the right thing to do here. I do believe that Python's titlecase function can and should be fixed to work correctly with Unicode. There really is no excuse for turning Aragón into AragóN, for example, or not doing the right thing with ⅷ and Ⅷ . I fear the only thing you can do with the confusion of Unicode titlecase and English titlecase is to explain that properly rendering English titles and headlines is a much more complicated job which you will not even attempt. (And shoudln't. English titelcase is clear too specialized for a general function.) However, I'm still bothered by things with apostrophes though. can't isn't woudn't've Bill's 'tisn't since I can't countenance the obviously wrong: Can'T Isn'T Woudn'T'Ve Bill'S 'Tisn'T with the last the hardest to get right. I do have code that correctly handles English words and code that correctly handles English titles, but it is much tricker the titlecase() function. And Swedes might be upset seeing Antonia Ax:Son Johnson instead of Antonia Ax:son Johnson. Maybe we should just go back to the Pythonic equivalent of s/(\w)(\w*)/ucfirst($1) . lc($2)/ge; where \w is specifically per tr18's Annex C, and give up on punctuation altogether, with a footnoted caveat or something. I wouldn't complain about that. The rest is just too, too hard. Wouldn't you agree? Thank you very much for all your hard work -- and patience with me. --tom ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12753> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com