P.J. Eby writes: > I know, it's a hard thing to wrap one's head around, since on the > surface it sounds like unicode is the programmer's savior.
I don't need to wrap my head around it. It's been deeply embedded, point first, and the nasty barbs ensure that I have no desire to pull it back out. To wit, I've been dealing with Japanese encoding issues on a daily basis for 20 years, and I'm well aware that programmers have several good reasons (and a lot more bad ones) for avoiding them, and even for avoiding Unicode when they must deal with encodings at all. I don't think any of the good reasons have been offered here yet, that's all. > Unfortunately, real-world text data exists which cannot be safely > roundtripped to unicode, and must be handled in "bytes with > encoding" form for certain operations. Or "Unicode with encoding" form. See below for why this makes sense in the context of Python. > I personally do not have to deal with this *particular* use case any > more -- I haven't been at NTT/Verio for six years now. As mentioned, I have a bit of understanding of the specific problems of Japanese-language computing. In particular, roundtripping Japanese from *any* encoding to *any other* encoding is problematic, because the national standards provide a proper subset of the repertoire actually used by the Japanese people. (Even JIS X 0213.) > My current needs are simpler, thank goodness. ;-) However, they > *do* involve situations where I'm dealing with *other* > encoding-restricted legacy systems, such as software for interfacing > with the US Postal Service that only works with a restricted subset > of latin1, while receiving mangled ASCII from an ecommerce provider, > and storing things in what's effectively a latin-1 database. Yes, I know of similar issues in other applications. For example, TeX error messages do not respect UTF-8 character boundaries, so Emacs has to handle them specially (basically a mechanism similar in spirit to PEP 383 is used). > Being able to easily assert what kind of bytes I've got would > actually let me catch errors sooner, *if* those assertions were > being checked when different kinds of strings or bytes were being > combined. i.e., at coercion time). I see that this would make life a little easier for you in maintaining without refactoring. I'd say it's a kludge, but without a full list of requirements I'm in no position to claim any authority <wink>. Eg, for a non-kludgey suggestion, how about defining a codec which takes Latin-1 bytes, checks (with error on failure) for the restricted subset, and converts to str? Then you can manipulate these things as str with abandon internally. Finally you get another check in the outgoing codec which converts from str to "effective Latin-1 bytes", however that is defined. But OK, maybe I'm just being naive. You need this unlovely artifice so you can put in asserts in appropriate places. Now, does it belong in the stdlib? It seems to me that in the case of Japanese roundtripping, *most* of the time encoding back to a standard Japanese encoding will work. If you run into one of the problematic characters that JIS doesn't allow but Japanese like to use because they prefer the glyph to the JIS-standard glyph, you get an occasional error on encoding to a standard Japanese encoding, which you handle specially with a database of such characters. Knowing the specific encoding originally used *normally does not help unless you're replying to that person and **only** that person*, because the extended repertoires vary widely and the only standard is Japanese. I conclude ebytes does *no* good here. For the ecommerce/USPS case, well, actually you need special-purpose encodings anyway (ISTM). 'latin-1' loses, the USPS is allergic to some valid 'latin-1' characters. 'ascii' loses, apparently you need some of the Latin-1 repertoire, and anyway AIUI the ecommerce provider munges the ASCII. So what does ebytes actually buy you here, unless you write the codecs? If you've got the codecs, what additional benefit do you get from ebytes? Note that you would *also* need to do explicit transcoding anyway if you were dealing with Japan Post instead of the USPS, although I grant your code is probably general enough to deal with Deutsche Telecom (but the German equivalent of your ecommerce provider probably has its own ways of munging Latin-1). I conclude that there may be genuine benefits to ebytes here, but they're probably not general enough to put in the stdlib (or the Python language). > Which works if and only if your outputs are truly unicode-able. With PEP 383, they always are, as long as you allow Unicode to be decoded to the same garbage your bytes-based program would have produced anyway. > If you work with legacy systems (e.g. those Asian email clients and > US postal software), you are really working with a *character set*, > not unicode, I think you're missing something. Namely, Unicode is a standard for handling character objects as integers, and a registry for mapping characters to integers. It includes over 100,000 points for making up your own mappings, and recent Python also provides (as an internal extension) for embedding non-characters in a str. Unicode does not define a repertoire, however. That's up to the application, and Python 2+ provides a convenient way to restrict repertoires by defining special purpose codecs in Python. It is then up to the program to ensure that all candidates claiming to be text pass through the cleansing fire of a codec before being allowed into the Pure Land of str. This can be something of a problem; there are a few ways for textual data to get into Python, and not all of them were obvious to me. But this problem would be even worse for mechanisms like ebytes, where it's up to the programmer to decide which things are put into ebytes. > and so putting your data in unicode form is actually *wrong* > -- an expedient lie. > > Heresy, I know, but there you go. ;-) It's not heresy, it's simply assuming a restriction on use of Unicode that just isn't true. It *is* true that mapping the data to Unicode according to some encoding is not always sufficient. It *is* often the case that further information must be provided to ensure semantic correctness. However, given the mapping (== properly defined codecs), roundtripping *is* always possible, at least up to the size of private space, which is big enough to hold the Post Office's repertoire, for sure. And that mapping is a Python object which will fit into a variable for later use. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com