Can you please join the Python WEB-SIG and continue the existing conversation there.
http://groups.google.com/group/python-web-sig?lnk= At the time I was merely facilitating a discussion and am not an expert on the issues. I have cc'd the web-sig for those who still may be interested in this. Graham ---------- Forwarded message ---------- From: Terry Reedy <tjre...@udel.edu> Date: 21 June 2010 13:56 Subject: Re: bytes / unicode To: Cc: graham.dumple...@gmail.com On 6/20/2010 9:33 PM, P.J. Eby wrote: > > At 07:33 PM 6/20/2010 -0400, Terry Reedy wrote: >> >> Do you have in mind any tools that could and should operate on both, >> but do not? > > From http://mail.python.org/pipermail/web-sig/2009-September/004105.html : Thank for the concrete examples in this and your other post. I am cc-ing the author of the above. > """The problem which arises is that unquoting of URLs in Python 3.X > stdlib can only be done on unicode strings. Actually, I believe this is an encoding rather than bytes versus unicode issue. > If though a string > > contains non UTF-8 encoded characters it can fail.""" Which is to say, I believe, if the ascii text in the (unicode) string has a % encoding of a byte that that is not a legal utf-8 encoding of anything. The specific example is >>> urllib.parse.parse_qsl('a=b%e0') [('a', 'b�')] where the character after 'b' is white ? in dark diamond, indicating an error. parse_qsl() splits that input on '=' and sends each piece to urllib.parse.unquote unquote() attempts to "Replace %xx escapes by their single-character equivalent.". unquote has an encoding parameter that defaults to 'utf-8' in *its* call to .decode. parse_qsl does not have an encoding parameter. If it did, and it passed that to unquote, then the above example would become (simulated interaction) >>> urllib.parse.parse_qsl('a=b%e0', encoding='latin-1') [('a', 'bà')] I got that output by copying the file and adding "encoding-'latin-1'" to the unquote call. Does this solve this problem? Has anything like this been added for 3.2? Should it be? > I don't have any direct experience with the specific issue demonstrated > in that post, but in the context of the discussion as a whole, I > understood the overall issue as "if you pass bytes to certain stdlib > functions, you might get back unicode, an explicit error, or (at least > in the case shown above) something that's just plain wrong." As indicated above, I so far think that the problem is with the application of the new model, not the model itself. Just for 'fun', I tried feeding bytes to the function. >>> p.parse_qsl(b'a=b%e0') Traceback (most recent call last): File "<pyshell#2>", line 1, in <module> p.parse_qsl(b'a=b%e0') File "C:\Programs\Python31\lib\urllib\parse.py", line 377, in parse_qsl pairs = [s2 for s1 in qs.split('&') for s2 in s1.split(';')] TypeError: Type str doesn't support the buffer API I do not know if that message is correct, but certainly trying to split bytes with unicode is (now, at least) a mistake. This could be 'fixed' by replacing the typed literals with expressions that match the type of the input. But I am not sure if that is sensible since the next step is to unquote and decode to unicode anyway. I just do not know the use case. Terry Jan Reedy _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com