Dan Dibagh <[EMAIL PROTECTED]> added the comment: > Which PEP specifically? PEP 263 only mentions the unicode-escape > encoding in its problem statement, i.e. as a pre-existing thing. > It doesn't specify it, nor does it give a rationale for why it behaves > the way it does.
PEP 100 and PEP 263. What I looked for was a description of the functional intention and a technical definition of raw unicode escape. The term "raw" tends to have different meanings depending on the context in which it appears. PEP 263 is of interest in the overall understanding of the intention of raw unicode escape. If raw unicode escape is to convert from python source into unicode strings then the decoding of raw unicode escape strings depends on the source code encoding. Then perhaps it would give an idea what the encoding part is supposed to do... PEP 100 is of interest for the technical description. It describes the section "unicode constructors" as the definition. > What code are you looking at, and where do you find it difficult to > follow it? Maybe you get confused between the "unicode-escape" codec, > and the "raw-unicode-escape" codec, also. Since it is the issue with non-ASCII characters in pickle output I look at, it is raw-unicode-escape being in focus. For the decoding bit the distinction between unicode-escape and raw-unicode-escape is very clear. I look at the function PyUnicode_EncodeRawUnicodeEscape in Objects/unicodeobject.c. At the point of the comment "/* Copy everything else as-is */", given the perceived intentions of the encoding type, I try to figure out why there isn't a "/* Map non-printable US ASCII to '\xhh' */" section like in the unicodeescape_string function. The background in older pythons you explained is essentially what I guessed. > The raw-unicode-escape codec? It was designed to support parsing of > Python 2.0 source code, and of "raw" unicode strings (ur"") in > particular. In Python 2.0, you only needed to escape characters above > U+0100; Latin-1 characters didn't need escaping. Python, itself, only > relied on the decoding directory. That the codec choses not to escape > Latin-1 characters on encoding is an arbitrary choice (I guess); it's > still symmetric with decoding. I suppose you mean symmetric with decoding as long as you stick to the latin-1 character set, as raw unicode escaping isn't a one-to-one mapping. When PEP 263 came into the picture, wouldn't it have made sense to change PyUnicode_EncodeRawUnicodeEscape to produce ASCII-only output, or perhaps output conforming to the current default encoding? Given the intention of the raw unicode escape, encoding something with it means producing python source code. But it is in latin-1 while the rest of Python has moved on to use ASCII by default or whatever being configured in the source. I tried to put shine on that problem in my previous example. > Even though the choice was arbitrary, you shouldn't change it now, > because people may rely on how this codec works. > Applications might rely on what was implemented rather than what was > specified. If they had implemented their own pickle readers, such > readers might break if the pickle format is changed. In principle, > even the old pickle readers of Python 2.0..2.6 might break if the >format changes in 2.7 - we would have to go back and check that they don't > break (although I do believe that they would work fine). Then let me ask: How far reaching is the aim to maintain compatibility with programs which depends on Python internals? Even if the internal thing is a bug and the thing which depends on the bug is also a bug? Maybe it is a provoking question, let me explain. The question(s) applies to some extent to the workings of the codec but it is really the pickle problem I think of. In the case of older Python releases, it is just a matter of testing, just as you say. It is boring and perhaps tedious but there is nothing special which prevents it from being done. If there are many versions there ought to be a way to write a program which does it automatically. In the case of those who have implemented their own pickle readers, the source and the comments in pickletools.py clearly states that unicode strings are raw unicode escaped in format 0. Now raw unicode escape isn't a canonical format. The letter A can be represented either as \u0041 or as itself as A. If a hypothetical implementor gets the idea that characters in the range 0-255 cannot be represented by \u00xx sequences then the fact that pickle replaces \ with \u005c and \n with \u000a should give a hint that he is wrong. So if characters in the range 128-255 gets escaped with \u00xx any pickle reader should handle it. I've tried to come up with some sensible way to write a pickle implemenation which fails to understand \u00xx characters without calling it a bug. I cannot. Can you? So it seems that the worry for changing protocol 0 is buggy programs depending on a pickle bug. In the other end of the spectrum there are correct programs with depends on Python externals, ie. programs depending in ASCII-conformant pickle output (even if there are some base64 ...ehm... fundamentalists who think it is the wrong way to do it -- I can think of at least one good reason to do it). > So contributions are welcome. If you find that the patch meets > resistance, you also need to write a PEP, and ask for BDFL > pronouncement. I consider doing a patch. I also understand that in order for the patch to get acceptance it must fit into the Python framework. That's why I ask all these questions. _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue2980> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com