On 08/28/2009 02:12 AM, "Martin v. Löwis" wrote: [I reordered the quotes from your previous post to try and get the responses in a more coherent order. No intent to take anything out of context...]
>> Nothing else in the PEP seems remotely relevant. [to providing justification for the behavior of unichr/ord] > > Except for the motivation, of course :-) > > In addition: your original question was "why has this > been changed", to which the answer is "it hasn't". My original interest was two-fold: can unichr/ord be changed to work in a more general and helpful way? That seemed remotely possible until it was pointed out that the two behave consistently, and that behavior is accurately documented. Second, why would they work the way they do when they could have been generalized to cover the full unicode space? An inadequate answer to this would have provided support for the first point but remains interesting to me for the reason below. > Then, the next question is "why is it implemented that > way", to which the answer is "because the PEP says so". Not at all a satisfying answer unless one believes in PEPal infallibility. :-) > Only *then* the question is "what is the rationale for > the PEP specifying things the way it does". The PEP is > relevant so that we can both agree that Python behaves > correctly (in the sense of behaving as specified). But my question had become: why that behavior, when a slightly different behavior would be more general with little apparent downside? To clarify, my interest in the justification for the current behavior is this: I think the best feature of python is not, as commonly stated, the clean syntax, but rather the pretty complete and orthogonal libraries. I often find, after I have written some code, that due to the right library functions being available, it turns out much shorter and concise than I expected. Nevertheless, every now and then, perhaps more than in some other languages (I'm not sure), I run into something that requires what seems to be excessive coding -- I have to do something it seems to me that a library function should have done for me. Sometimes this is because I don't under- stand the reason the library function needs to works the way it does. Other times it is one of the countless trade- off made in the design of the language, which didn't happen to go the way that would have been beneficial to me in a particular coding situation. But sometimes (and it feels too often) it seems as though, zen not withstanding, that purity -- adherence to some philosophic ideal -- beat practicality. unichr/ord seems such as case to me, But I want to be sure I am not missing something. The reasons for the current behavior so far: 1. > What you propose would break the property "unichr(i) always returns > a string of length one, if it returns anything at all". Yes. And i don't see the problem with that. Why is that property more desirable than the non-existent property that a Unicode literal always produces one python character? It would only occur on a narrow build with a unicode character outside of the bmp, exactly the condition a unicode literal can "behave differently" by producing two python characters. 2. > > But there is no reason given [in the PEP] for that behavior. > Sure there is, right above the list: > "Most things will behave identically in the wide and narrow worlds." > That's the reason: scripts should work the same as much as possible > in wide and narrow builds. So what else would work "differently"? My point was that extending unichr/ord to work with all unicode characters reduces differences far more often than it increase them. 3. >> * There is a convention in the Unicode world for >> encoding a 32-bit code point in terms of two >> 16-bit code points. These are known as >> "surrogate pairs". Python's codecs will adopt >> this convention. >> >> Is a distinction made between Python and Python >> codecs with only the latter having any knowledge of >> surrogate pairs? > > No. In the end, the Unicode type represents code units, > not code points, i.e. half surrogates are individually > addressable. Codecs need to adjust to that; in particular > the UTF-8 and the UTF-32 codec in narrow builds, and the > UTF-16 codec in wide builds (which didn't exist when the > PEP was written). OK, so that is not a reason either. 4. I'll speculate a little. If surrogate handling was added to ord/unichr, it would be the top of a slippery slope leading to demands that other string functions also handle surrogates. But this is not true -- there is a strong distinction between ord/unichr and other string methods. The latter deal with strings of multiple characters. But the former deals only with single characters (taking a surrogate pair as a single unicode character.) The behavior of ord/unichr is independent of the other string methods -- if they were changed with regard to surrogate handling they would all have to be changed to maintain consistent behavior. Unichr/str affect only each other. The functions of ord/unichr -- to map characters to numbers -- are fundamental string operations, akin to indexing or extracting a substring. So why would one want to limit them to a subset of characters if not absolutely necessary? To reiterate, I am not advocating for any change. I simply want to understand if there is a good reason for limiting the use of unchr/ord on narrow builds to a subset of the unicode characters that Python otherwise supports. So far, it seems not and that unichr/ord is a poster child for "purity beats practicality". -- http://mail.python.org/mailman/listinfo/python-list