Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Matt Giuca
Wow .. a lot of replies today! On Thu, Aug 7, 2008 at 2:09 AM, Martin v. Löwis [EMAIL PROTECTED]wrote: It hasn't been given priority: There are currently 606 patches in the tracker, many fixing bugs of some sort. It's not clear (to me, at least) why this should be given priority over all the

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-07 Thread Guido van Rossum
FWIW, the rest of this discussion is now happening in the tracker: http://bugs.python.org/issue3300. We could really use some feedback from Python users in Asian countries. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread André Malo
* Bill Janssen wrote: I'm far less concerned about the decision with regards to unquote_to_bytes/quote_from_bytes, as those are new features which can wait. Forgive me, but those are the *old* features, which must be there. This whole discussion circles too much, I think. Maybe it

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Matt Giuca
This whole discussion circles too much, I think. Maybe it should be pepped? The issue isn't circular. It's been patched and tested, then a whole lot of people agreed including Guido. Then you and Bill wanted the bytes functionality back. So I wrote that in there too, and Bill at least said that

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread André Malo
* Matt Giuca wrote: This whole discussion circles too much, I think. Maybe it should be pepped? The issue isn't circular. It's been patched and tested, then a whole lot of people agreed including Guido. Then you and Bill wanted the bytes functionality back. So I wrote that in there too,

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Matt Giuca
There are a lot of quotes around. Including After the most recent flurry of discussion I've lost track of what's the right thing to do. But I don't talk for other people. OK .. let me compose myself a little. Sorry I went ahead and assumed this was closed. It's just frustrating to me that

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Scott Dial
André Malo wrote: * Matt Giuca wrote: We've reached, to quote Guido, as close as consensus as we can get on this issue. There are a lot of quotes around. Including After the most recent flurry of discussion I've lost track of what's the right thing to do. But I don't talk for other people.

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Bill Janssen
I suggest we continue this discussion, if at all, on the bug-tracker, where there's code, and more participants. http://bugs.python.org/issue3300 I've now posted my idea of how quote/unquote should work in py3K, there. Bill ___ Python-Dev mailing list

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Martin v. Löwis
Nobody's been assigned to look at it and it hasn't been given a priority, even though we all agree it's a bug (though we disagree on how to fix it). This I can explain (I think). Nobody is assigned to look: we usually don't do assignments of bugs or patches, except when there is a specific

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Antoine Pitrou
Martin v. Löwis martin at v.loewis.de writes: URLs are just not made for non-ASCII characters. Perhaps they are not, but every non-English wiki (just to take a simple, generic example) potentially contains non-ASCII URLs. e.g. http://fr.wikipedia.org/wiki/%C3%89l%C3%A9phant

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Martin v. Löwis
Implement IRIs if you want non-ASCII characters; the rules are much clearer for these. I think most people would expect something which works with the current World Wide Web rather than a rigorous implementation of a specific RFC. Implementing RFCs is fine but it does not magically

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread M.-A. Lemburg
On 2008-08-06 18:55, Antoine Pitrou wrote: Martin v. Löwis martin at v.loewis.de writes: URLs are just not made for non-ASCII characters. Perhaps they are not, but every non-English wiki (just to take a simple, generic example) potentially contains non-ASCII URLs. e.g.

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-06 Thread Guido van Rossum
On Wed, Aug 6, 2008 at 9:09 AM, Martin v. Löwis [EMAIL PROTECTED] wrote: Nobody's been assigned to look at it and it hasn't been given a priority, even though we all agree it's a bug (though we disagree on how to fix it). This I can explain (I think). Nobody is assigned to look: we usually

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-05 Thread Matt Giuca
Has anyone had time to look at the patch for this issue? It got a lot of support about a week ago, but nobody has replied since then, and the patch still hasn't been assigned to anybody or given a priority. I hope I've complied with all the patch submission procedures. Please let me know if there

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-05 Thread Guido van Rossum
After the most recent flurry of discussion I've lost track of what's the right thing to do. I also believe it was said it should wait until 2.7/3.0, so there's no hurry (in fact there's no way to check it -- we don't have branches for those versions yet). On Tue, Aug 5, 2008 at 5:47 AM, Matt

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-05 Thread Matt Giuca
After the most recent flurry of discussion I've lost track of what's the right thing to do. I also believe it was said it should wait until 2.7/3.0, so there's no hurry (in fact there's no way to check it -- we don't have branches for those versions yet). I assume you mean 2.7/3.1. I've

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-08-05 Thread Bill Janssen
I'm far less concerned about the decision with regards to unquote_to_bytes/quote_from_bytes, as those are new features which can wait. Forgive me, but those are the *old* features, which must be there. Bill ___ Python-Dev mailing list

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Stephen J. Turnbull
Matt Giuca writes: OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded. In other words, it's an

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Bill Janssen
Guido says: Actually, we'd need to look at the various other APIs in Py3k before we can decide whether these should be considered taking or returning bytes or text. It looks like all other APIs in the Py3k version of urllib treat URLs as text. Yes, as I said in the bug tracker,

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Bill Janssen
Of course, it's un-Pythonic to enforce pedantry, and we pedants can use a string-string encoder correctly. Sure. All I was asking was that we not break the existing usage of the standard library unquote by producing a string by *assuming* a UTF-8 encoded string is what's in those

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Bill Janssen
Also see http://en.wikipedia.org/wiki/Percent-encoding. Bill ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Stephen J. Turnbull
Bill Janssen writes: A quoting function that accepts bytes *must* have an encoding argument. Huh? What would it use it for? Ah, you're right. I was thinking in terms of an URI builder, where the quoter would do any required conversion (eg, if the bytes represented a string in

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
Alright, I've uploaded the new patch which adds the two requested bytes-oriented functions, as well as accompanying docs and tests. http://bugs.python.org/issue3300 http://bugs.python.org/file11009/parse.py.patch6 I'd rather have two pairs of functions, so that those who want to give the readers

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
Bill wrote: I'm not sure that's sufficient review, though I agree it's necessary. The major consumers of quote/unquote are not in the Python standard library. I figured that Python 3.0 is designed to fix things, with the breaking third-party code being an acceptable side-effect of that. So

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Jeff Hall
quote_from_bytes = quote So either name can be used on either input type, with the idea being that you should use quote on a str, and quote_from_bytes on a bytes. Is this a good idea or should it be rewritten so each function permits only one input type? so you can use quote_from_bytes

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-31 Thread Matt Giuca
so you can use quote_from_bytes on strings? Yes, currently. I assumed Guido meant it was okay to have quote accept string/byte input and have a function that was redundant but limited in what it accepted (i.e. quote_from_bytes accepts only bytes) I suppose your implementation doesn't

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
Hi folks, This issue got some attention a few weeks back but it seems to have fallen quiet, and I haven't had a good chance to sit down and reply again till now. As I've said before this is a serious issue which will affect a great deal of code. However it's obviously not as clear-cut as I

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
Arg! Damnit, why do my replies get split off from the main thread? Sorry about any confusion this may be causing. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Oleg Broytmann
On Thu, Jul 31, 2008 at 12:11:40AM +1000, Matt Giuca wrote: 2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Count me too: +1. Most sites I use theese days use UTF-8 for URL encoding. Examples: Wikipedia:

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Facundo Batista
2008/7/30 Matt Giuca [EMAIL PROTECTED]: 2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Pros: Fully working and tested solution is implemented; recommended by RFC 3986 for all future schemes; recommended by W3C for use with HTML; UTF-8 used by all

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Antoine Pitrou
Facundo Batista facundobatista at gmail.com writes: 2008/7/30 Matt Giuca matt.giuca at gmail.com: 2. Default to UTF-8. In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven Pros: Fully working and tested solution is implemented; recommended by RFC 3986 for all future

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread André Malo
[I was pretty busy these days, so sorry for jumping in late again] * Matt Giuca wrote: 1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to UTF-8. unquote is Latin-1. In favour: Anybody who doesn't reply to this thread Pros: Already implemented; some existing code depends

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 8:09 AM, André Malo [EMAIL PROTECTED] wrote: I'm actually in favour of encoding bytes only back and forth. A useful extension would be *another* function which wraps quote/unquote and encodes and decodes characters. I'd reverse this. By all means, add a new pair of

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
For unquote, I think it will break a lot and surprise everyone. I think that while this may be purely the best option, it's pretty silly. I don't mind being silly to do the right thing. Happens to me a lot :-). Bill ___ Python-Dev mailing list

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
On Wed, Jul 30, 2008 at 8:09 AM, André Malo [EMAIL PROTECTED] wrote: I'm actually in favour of encoding bytes only back and forth. A useful extension would be *another* function which wraps quote/unquote and encod= es and decodes characters. I'd reverse this. By all means, add a new

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 9:52 AM, Bill Janssen [EMAIL PROTECTED] wrote: On Wed, Jul 30, 2008 at 8:09 AM, André Malo [EMAIL PROTECTED] wrote: I'm actually in favour of encoding bytes only back and forth. A useful extension would be *another* function which wraps quote/unquote and encod= es

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
Actually (as I pointed out before) the existing functions are not string-in/string-out. They are something-in and bytes-out. Sorry, this is wrong. quote is clearly bytes-in and string-out. unquote is clearly string-in and bytes-out. The whole point of quote is to take an arbitrary sequence

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
It looks like all other APIs in the Py3k version of urllib treat URLs as text. The URL is text, a string of ASCII characters. We're just talking about urllib.quote() and urllib.unquote(), which are there to support the text-ization of binary values, and the de-text-ization. I think that

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 10:33 AM, Bill Janssen [EMAIL PROTECTED] wrote: It looks like all other APIs in the Py3k version of urllib treat URLs as text. The URL is text, a string of ASCII characters. We're just talking about urllib.quote() and urllib.unquote(), which are there to support the

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Jeff Hall
(Aside: I dislike functions that have a different return type based on the value of a parameter.) I wanted to stay out of the whole discussion as it's largely over my head... But I did want to express support for this idea which I think almost rises to the level of a standard... I see more

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
unquote() -- takes string, produces bytes or string If optional encoding parameter is specified, decodes bytes with that encoding and returns string. Otherwise, returns bytes. The default of returning bytes will break almost all uses. Most code will uses the unquoted result

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 12:49 PM, Bill Janssen [EMAIL PROTECTED] wrote: unquote() -- takes string, produces bytes or string If optional encoding parameter is specified, decodes bytes with that encoding and returns string. Otherwise, returns bytes. The default of returning bytes

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Bill Janssen
I think this is as close as consensus as we can get on this issue. Can whoever wrote the patch adjust the patch to this outcome? (I think the only change is to remove the encoding arguments and make separate functions for bytes.) This is 2.7/3.1 only, right? I'm looking at the bales of code

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Matt Giuca
Con: URI encoding does not encode characters. OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded. From

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-30 Thread Guido van Rossum
On Wed, Jul 30, 2008 at 8:49 PM, Matt Giuca [EMAIL PROTECTED] wrote: Con: URI encoding does not encode characters. OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-14 Thread Bill Janssen
Clearly the unquote is str-bytes, snip You can't pass a Unicode string back as the result of unquote *without* passing in an encoding specifier, because the character set is application-specific. So for unquote you're suggesting that it always return a bytes object UNLESS an encoding is

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-13 Thread André Malo
* Matt Giuca wrote: This POV is way too browser-centric... This is but one example. Note that I found web forms to be the least clear-cut example of choosing an encoding. Most of the time applications seem to be using UTF-8, and all the standards I have read are moving towards specifying

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-13 Thread Bill Janssen
Ah there may be some confusion here. We're only dealing with str-str transformations (which in Python 3 means Unicode strings). You can't put a bytes in or get a bytes out of either of these functions. I suggested a quote_raw and unquote_raw function which would let you do this. Ah, well,

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-13 Thread Matt Giuca
On Mon, Jul 14, 2008 at 4:54 AM, André Malo [EMAIL PROTECTED] wrote: Ahem. The HTTP standard does ;-) Really? Can you include a quotation please? The HTTP standard talks a lot about ISO-8859-1 (Latin-1) in terms of actually raw encoded bytes, but not in terms of URI percent-encoding (a

[Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Hi all, My first post to the list. In fact, first time Python hacker, long-time Python user though. (Melbourne, Australia). Some of you may have seen for the past week or so my bug report on Roundup, http://bugs.python.org/issue3300 I've spent a heap of effort on this patch now so I'd really

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Brett Cannon
On Sat, Jul 12, 2008 at 10:27 AM, Matt Giuca [EMAIL PROTECTED] wrote: Hi all, My first post to the list. In fact, first time Python hacker, long-time Python user though. (Melbourne, Australia). Welcome! Some of you may have seen for the past week or so my bug report on Roundup,

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Bill Janssen
Basically, urllib.quote and unquote seem not to have been updated since Python 2.5, and because of this they implicitly perform Latin-1 encoding and decoding (with respect to percent-encoded characters). I think they should default to UTF-8 for a number of reasons, including that's what other

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Jeroen Ruigrok van der Werven
-On [20080712 19:27], Matt Giuca ([EMAIL PROTECTED]) wrote: Basically, urllib.quote and unquote seem not to have been updated since Python 2.5, and because of this they implicitly perform Latin-1 encoding and decoding (with respect to percent-encoded characters). I think they should default to

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Martin v. Löwis
Very nice, I had this somewhere on my todo list to work on. I'm very much in favour, especially since it synchronizes us with the RFCs (for all I remember reading about it last time). I still think that it doesn't. The RFCs haven't changed, and can't change for compatibility reasons. The

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
Thanks for all the replies, and making me feel welcome :) If what you are saying is true, then it can probably go in as a bug fix (unless someone else knows something about Latin-1 on the Net that makes this not true). Well from what I've seen, the only time Latin-1 naturally appears on the

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread André Malo
* Matt Giuca wrote: Well from what I've seen, the only time Latin-1 naturally appears on the net is when you have a web page in Latin-1 (either explicit or inferred; and note that a browser like Firefox will infer Latin-1 if it sees only ASCII characters) with a form in it. Submitting the

Re: [Python-Dev] urllib.quote and unquote - Unicode issues

2008-07-12 Thread Matt Giuca
This POV is way too browser-centric... This is but one example. Note that I found web forms to be the least clear-cut example of choosing an encoding. Most of the time applications seem to be using UTF-8, and all the standards I have read are moving towards specifying UTF-8 (from being