Tom Christiansen <tchr...@perl.com> added the comment: David Murray <rep...@bugs.python.org> wrote:
> Tom, note that nobody is arguing that what you are requesting is a bad > thing :) There looked to be minor some resistance, based on absolute backwards compatibility even if wrong, regarding changing anything *at all* in re, even things that to my jaded seem like actual bugs. There are bugs, and then there are bugs. In my survey of Unicode support across 7 programming languages for OSCON http://training.perl.com/OSCON/index.html I came across a lot of weirdnesses, especially as first when the learning curve was high. Sure, I found it odd that unlike Java, Perl, and Ruby, Python didn't offer regular casemapping on strings, only the simple character-based mapping. But that doesn't make it a bug, which is why I filed it as an feature/enhancement request/wish, not as a bug. I always count as bugs not handling Unicode text the way Unicode says it must be handled. Such things would be: Emitting CESU-8 when told to emit UTF-8. Violating the rule that UTF-8 must be in the shortest possible encoding. Not treating a code point as a letter when the supported version of the UCD says it is. (This can happen if internal rules get out of sync with the current UCD.) Claiming one does "the expected thing on Unicode" for case-insensitive matches when not doing what Unicode says you must minimally do: use at least the simple casefolds, if not in fact the full ones. Saying \w matches Unicode word characters when one's definition of word characters differs from that of the supported version of the UCD. Supporting Unicode vX.Y.Z is more than adding more characters. All the behaviors specified in the UCD have to be updated too, or else you are just ISO 10646. I believe some of Python's Unicode bugs happened because folks weren't aware which things in Python were defined by the UCD or by various UTS reports yet were not being directly tracked that way. That's why its important to always fully state which version of these things you follow. Other bugs, many actually, are a result of the narrow/wide-build untransparency. There is wiggle room in some of these. For example, which is the one that applies to re, in that you could -- in a sense -- remove the bug by no longer claiming to do case-insensitive matches on Unicode. I do not find that very useful. Javascript works this way: it doesn't do Unicode casefolding. Java you have to ask nicely with the extra UNICODE_CASE flag, aka "(?u)", used with the CASE_INSENSITIVE, aka "(?i)". Sometimes languages provide different but equivalent interfaces to the same functionality. For example, you may not support the Unicode property \p{NAME=foobar} in patterns but instead support \N{foobar} in patterns and hopefully also in strings. That's just fine. On slightly shakier ground but still I think defensible is how one approaches support for the standard UCD properties: Case_Folding Simple_Case_Folding Titlecase_Mapping Simple_Titlecase_Mapping Uppercase_Mapping Simple_Uppercase_Mapping Lowercase_Mapping Simple_Lowercase_Mapping One can support folding, for example, via (?i) and not have to directly supporting a Case_Folding property like \p{Case_Folding=s}, since "(?i)s" should be the same thing as "\p{Case_Folding=s}". > As far as I know, Matthew is the only one currently working on the > regex support in Python. (Other developers will commit small fixes if > someone proposes a patch, but no one that I've seen other than Matthew > is working on the deeper issues.) If you want to help out that would > be great. Yes, I actually would. At least as I find time for it. I'm a competent C programmer and Matthew's C code is very well documented, but that's very time consuming. For bang-for-buck, I do best on test and doc work, making sure things are actually working the way they say do. I was pretty surprised and disappointed by how much trouble I had with Unicode work in Python. A bit of that is learning curve, a bit of it is suboptimal defaults, but quite a bit of it is that things either don't work the way Unicode says, or because something is altogether missing. I'd like to help at least make the Python documentation clearer about what it is or is not doing in this regard. But be warned: one reason that Java 1.7 handles Unicode more according to the published Unicode Standard in its Character, String, and Pattern classes is because when they said they'd be supporting Unicode 6.0.0, I went through those classes and every time I found something in violation of that Standard, I filed a bug report that included a documentation patch explaining what they weren't doing right. Rather than apply my rather embarrassing doc patches, they instead fixed the code. :) > And as far as this particular issue goes, yes the difference between > the narrow and wide build has been a known issue for a long time, but > has become less and less ignorable as unicode adoption has grown. Little would make me happier than for Python to move to a logical character model the way Go and Perl treat them. I find getting bogged down by code units to be abhorrently low level, and it doesn't matter whether these are 8-bit code units like in PHP or 16-bit code units like in Java. The Nth character of a string should always be its Nth logical code point not its Nth physical code units. In regular expressions, this is a clearly stated requirement in the Unicode Standard (see tr18 RL1.1 and RL1.7). However, it is more than that. The entire character processing model really really should be logical not physical. That's because you need to have whole code points not broken-up code units before you can build on the still higher-level components needed to meet user expectations. These include user-visible graphemes (like an E with both a combining acute and a combining up tack below) and linguistic collating elements (like the letter <ch> in Spanish, <dzs> in Hungarian, or <dd> in Welsh). > Martin's PEP that Matthew references is the only proposed fix that I > know of. There is a GSoc project working on it, but I have no idea > what the status is. Other possible fixes are using UTF-8 or UTF-32. One reason I don't like that PEP is because if you are really that concerned wiht storage space, it is too prone to spoilers. Neither UTF-8 nor UTF-32 have any spoilers, but that PEP does. What I mean is that just one lone code point in a huge string is enough to change the representation of everything in that string. I think of these as spoilers; strings that are mostly non-spoilers with just a bit of spoiler in them are super super common. Often it's all ASCII plus just a few ISO-8859-1 or other non-ASCII Latin characters. Or it's all Latin with a couple of non-BMP mathematical alphanumerics thrown in. That kind of thing. Consider this mail message. It contains exactly six non-ASCII code points. % uniwc `mhpath cur +drafts` Paras Lines Words Graphs Chars Bytes File 79 345 2796 16899 16899 16920 /home/tchrist/Mail/drafts/1183 Because it is in UTF-8, its memory profile in bytes grows only very slightly over its character count. However, if you adopt the PEP, then you pay and pay and pay, very nearly quadrupling the memory profile for six particular characters. Now it takes 67596 bytes intead of 16920, just for the sake of six code points. Ouch!! Why would you want to do that? You say you are worried about memory, but then you would do this sort of thing. I just don't understand. I may be wrong here, not least because I can think of possible extenuating circumstances, but it is my impression that there there is an underlying assumption in the Python community and many others that being able to access the Nth character in a string in constant time for arbitrary N is the most important of all possible considerations. I I don't believe that makes as much sense as people think, because I don't believe character strings really are accessed in that fashion very often at all. Sure, if you have a 2D matrix of strings where a given row-column pair yields one character and you're taking the diagonal you might want that, but how often does that actually happen? Virtually never: these are strings and not matrices we're running FFTs on after all. We don't need to be able to load them into vector registers or anything the way the number-crunching people do. That's because strings are a sequence of characters: they're text. Whether reading text left to right, right to left, or even boustrophedonically, you're always going one past the character you're currently at. You aren't going to the Nth character forward or back for arbitrary N. That isn't how people deal with text. Sometimes they do look at the end, or a few in from the far end, but even that can be handled in other fashions. I need to see firm use-case data justifying this overwhelming need for O(1) access to the the Nth character before I will believe it. I think it is too rare to be as concerned with as so many people bizarrely appear to be. This attachment has serious consequences. It is because of this attachment that the whole narrow/wide build thing occurs, where people are willing to discard a clean, uniform processing model in search of what I do not believe a reasonable or realistic goal. Even if they *were* correct, *far* more bugs are caused by unreliability than by performance. * If you were truly concerned with memory use, you would simply use UTF-8. * If you were truly concerned with O(1) access time, you would always use UTF-32. * Anything that isn't one of these two is some sort of messy compromise. I promise that nobody ever had a BMP vs non-BMP bug who was working with either UTF-8 or UTF-32. This only happens with UTF-16 and UCS-2, which have all the disadvantages of both UTF-8 and UTF-32 combined yet none of the advantages of either. It's the worst of both worlds. Because you're using UTF-16, you're already paying quite a bit of memory for text processing in Python compared to doing so in Go or in Perl, which are both UTF-8 languages. Since you're already used to paying extra, what's so wrong with going to purely UTF-32? That too would solve things. UTF-8 is not the memory pig people allege it is on Asian text. Consider: I saved the Tokyo Wikipedia page for each of these languages as NFC text and generated the following table comparing them. I've grouped the languages into Western Latin, Western non-Latin, and Eastern. Paras Lines Words Graphs Chars UTF16 UTF8 8:16 16:8 Language 519 1525 6300 43120 43147 86296 44023 51% 196% English 343 728 1202 8623 8650 17302 9173 53% 189% Welsh 541 1722 9013 57377 57404 114810 59345 52% 193% Spanish 529 1712 9690 63871 63898 127798 67016 52% 191% French 321 837 2442 18999 19026 38054 21148 56% 180% Hungarian 202 464 976 7140 7167 14336 11848 83% 121% Greek 348 937 2938 21439 21467 42936 36585 85% 117% Russian 355 788 613 6439 6466 12934 13754 106% 94% Chinese, simplified 209 419 243 2163 2190 4382 3331 76% 132% Chinese, traditional 461 1127 1030 25341 25368 50738 65636 129% 77% Japanese 410 925 2955 13942 13969 27940 29561 106% 95% Korean Where: * Paras is the number of blankline-separated text spans. * Lines is the number of linebreak-separated text spans. * Words is the number of whitespace-separated text spans. * Graphs is the number of Unicode extended grapheme clusters. * Chars is the number of Unicode code points. * UTF16 is how many bytes it takes up stored as UTF-16. * UTF8 is how many bytes it takes up stored as UTF-8. * 8:16 is the ratio of UTF-8 size to UTF-16 size as a percentage. * 16:8 is the ratio of UTF-16 size to UTF-8 size as a percentage. * Language is which version of the Tokyo page we're talking about here. Here are my observations: * Western languages that use the Latin script suffer terribly upon conversion from UTF-8 to UTF-16, with English suffering the most by expanding by 96% and Hungarian the least by expanding by 80%. All are huge. * Western languages that do not use the Latin script still suffer, but only 15-20%. * Eastern languages DO NOT SUFFER in UTF-8 the way everyone claims that they do! To expand on the last point: * In Korean and in (simplified) Chinese, you get only 6% bigger in UTF-8 than in UTF-16. * In Japanese, you get only 29% bigger in UTF-8 than in UTF-16. * The traditional Chinese actually got smaller in UTF-8 than in UTF-16! In fact, it costs 32% to use UTF-16 over UTF-8 for this sample. If you look at the Lines and Words columns, it looks that this might be due to white space usage. So UTF-8 isn't even too bad on Asian languages. But you howl that it's variable width. So? You're already using a variable-width encoding in Python on narrow builds. I know you think otherwise, but I'll prove this below. Variable width isn't as bad as people claim, partly because fixed width is not as good as they claim. Think of the kind of operations that you normally do on strings. You want to to go to the next user-visible grapheme, or to the end of the current word, or go to the start of the next line. UTF-32 helps you with none of those, and UTF-8 does not hurt them. You cannot instantly go to a particular address in memory for any of those unless you build up a table of offsets as some text editors sometimes do, especially for lines. You simply have to parse it out as you go. Here's why I say that Python uses UTF-16 not UCS-2 on its narrow builds. Perhaps someone could tell me why the Python documentation says it uses UCS-2 on a narrow build. I believe this to be an error, because otherwise I cannot explain how you can have non-BMP code points in your UTF-8 literals in your source code. And you clearly can. #!/usr/bin/env python3.2 # -*- coding: UTF-8 -*- super = "𝔘𝔫𝔦𝔠𝔬𝔡𝔢" print(super) This is with a narrow build on Darwin: % python3.2 -c 'import sys; print(sys.maxunicode)' 65535 % export PYTHONIOENCODING=UTF-8 PERLUNICODE=S % python3.2 supertest | uniwc Paras Lines Words Graphs Chars Bytes File 0 1 1 8 8 29 standard input % python3.2 supertest | uniquote -x \x{1D518}\x{1D52B}\x{1D526}\x{1D520}\x{1D52C}\x{1D521}\x{1D522} Observations and conclusion: * You are emitting 8 code points, 7 in the SMP not in the BMP. * You clearly understand code points above your alleged maxunicode value. * If you were actually using UCS-2, those would not be possible. * I submit that this proves you are actually using UTF-16. Q.E.D. Yet you are telling people you are using UCS-2. Why is that? Since you are already using a variable-width encoding, why the supercilious attitude toward UTF-8? UTF-16 has the same properties but costs you a lot more. As I said before, UTF-16 puts you in the worst of all worlds. * If you were truly concerned with memory use, you would simply use UTF-8. * If you were truly concerned with O(1) access time, you would always use UTF-32. * Anything that isn't one of these two is some sort of messy compromise. But even with UTF-16 you *can* present to the user a view of logical characters that doesn't care about the underlying clunkish representation. The Java regex engine proves that, since "." always matches a single code point no matter whether it is in the BMP or not. Similarly, ICU's classes operate on logical characters -- code points not units -- even though they use UTF-16 languages. The Nth code point does not care and should not care how many units it takes to get there. It is fine to have both a byte interface *and* a character interface, but I don't believe having something that falls in between those two is of any use whatsoever. And if you don't have a code point interface, you don't have a character interface. This is my biggest underlying complaint about Python's string model, but I believe it fixable, even if doing so exceeds my own personal background. --tom ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue12729> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com