[I was pretty busy these days, so sorry for jumping in late again] * Matt Giuca wrote:
> 1. Leave it as it is. quote is Latin-1 if range(0,256), fallback to > UTF-8. unquote is Latin-1. > In favour: Anybody who doesn't reply to this thread > Pros: Already implemented; some existing code depends upon ord values > of string being the same as they were for byte strings; possible to > hack around it. > Cons: unquote is not inverse of quote; quote behaviour > internally-inconsistent; garbage when unquoting UTF-8-encoded URIs. > 2. Default to UTF-8. > In favour: Matt Giuca, Brett Cannon, Jeroen Ruigrok van der Werven > Pros: Fully working and tested solution is implemented; recommended by > RFC 3986 for all future schemes; recommended by W3C for use with HTML; > UTF-8 used by all major browsers; supports all characters; most > existing code compatible by default; unquote is inverse of quote. > Cons: By default, URIs may have invalid octet sequences (not possible > to reverse). Con: URI encoding does not encode characters. > > 3. quote default to UTF-8, unquote default to Latin-1. > In favour: André Malo > Pros: quote able to handle all characters; unquote able to handle all > sequences. Cons: unquote is not inverse of quote; totally inconsistent. I'm not in favour of that. I merely answered a question there ;) I'm actually in favour of encoding bytes only back and forth. A useful extension would be *another* function which wraps quote/unquote and encodes and decodes characters. > 4. quote accepts either bytes or str, unquote default to outputting > bytes unless given an encoding argument. > In favour: Bill Janssen > Pros: Technically does what the spec says, which is treat it as an > octet encoding. > Cons: unquote will break most existing code; almost 100% of the time > people will want it as a string. > > </impartiality> > > I'll just comment on #4 since I haven't already. Let's talk about > quote and unquote separately. For quote, I'm all for letting it accept > a bytes as well as a str. That doesn't break anything or surprise > anyone. > > For unquote, I think it will break a lot and surprise everyone. I > think that while this may be "purely" the best option, it's pretty > silly. I reckon the vast majority of users will be surprised when they > see it spitting out a bytes object, and all that most people will do > is decode it as UTF-8. Besides, while you're reading the RFCs as "URLs > specify a method for encoding octet sequences", I'm reading them as > "URLs specify a method for encoding strings, and leave the character > encoding unspecified." The second reading supports the idea that > unquote outputs a str. > > I'm also recommending we add unquote_to_bytes to do what you suggest > unquote should do. (So either way we'll get both versions of unquote; > I'm just suggesting the one called "unquote" do the thing everybody > expects). But that's less of a priority so I want to commit these > urgent fixes first. > > I'm basically saying just two things: 1. The standards are undefined; That's still disputed... > 2. Therefore we should pick the most useful and/or intuitive default. > IMHO choosing UTF-8 *is* the most useful AND intuitive, and will be > more so in the future when more technologies are hard-coded as UTF-8 > (which this RFC recommends they do in the future). See my suggestion above. nd _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com