Thanks for all the replies, and making me feel welcome :) > > If what you are saying is true, then it can probably go in as a bug > fix (unless someone else knows something about Latin-1 on the Net that > makes this not true). >
Well from what I've seen, the only time Latin-1 naturally appears on the net is when you have a web page in Latin-1 (either explicit or inferred; and note that a browser like Firefox will infer Latin-1 if it sees only ASCII characters) with a form in it. Submitting the form, the browser will use Latin-1 to percent-encode the query string. So if you write a web app and you don't have any non-ASCII characters or mention the charset, chances are you'll get Latin-1. But I would argue you're leaving things to chance and you deserve to get funny behaviour. If you do any of the following: - Use a non-ASCII character, encoded as UTF-8 on the page. - Send a Content-Type: xxxx; charset=utf-8. - In HTML, set a <meta http-equiv="Content-Type: xxxx; charset=utf-8" />. - In the form itself, set <form accept-encoding="utf-8">. then the browser will encode the form data as UTF-8. And most "proper" web pages should get themselves explicitly served as UTF-8. That I can't say I can necessarily due; have my own bug reports to > work through this weekend. =) OK well I'm busy for the next few days; after that I can do a patch trade with someone. (That is if I am allowed to do reviews; not sure since I don't have developer privileges). On Sun, Jul 13, 2008 at 5:58 AM, Mark Hammond <[EMAIL PROTECTED]> wrote: > > My first post to the list. In fact, first time Python hacker, > > long-time Python user though. (Melbourne, Australia). > > Cool - where exactly? I'm in Wantirna (although not at this very moment - > I'm in Lithuania, but home again in a couple of days) Cool :) Balwyn. > * Please take Martin with a grain of salt ( \I would say "ignore him", but > that is too strong ;) Lol, he is a hard man to please, but he's given some good feedback. On Sun, Jul 13, 2008 at 7:07 AM, Bill Janssen <[EMAIL PROTECTED]> wrote: > > The standard here is RFC 3986, from Jan 2005, which says, > > ``When a new URI scheme defines a component that represents textual > data consisting of characters from the Universal Character Set [UCS], > the data should first be encoded as octets according to the UTF-8 > character encoding [STD63]; then only those octets that do not > correspond to characters in the unreserved set should be > percent-encoded.'' Ah yes, I was originally hung up on the idea that "URLs had to be encoded in UTF-8", till Martin pointed out that it only says "new URI scheme" there. It's perfectly valid to have non-UTF-8-encoded URIs. However in practice they're almost always UTF-8. So I think introducing the new encoding argument and having it default to "utf-8" is quite reasonable. I'd say, treat the incoming data as either Unicode (if it's a Unicode > string), or some unknown superset of ASCII (which includes both > Latin-1 and UTF-8) if it's a byte-string (and thus in some unknown > encoding), and apply the appropriate transformation. > Ah there may be some confusion here. We're only dealing with str->str transformations (which in Python 3 means Unicode strings). You can't put a bytes in or get a bytes out of either of these functions. I suggested a "quote_raw" and "unquote_raw" function which would let you do this. The issue is with the percent-encoded characters in the URI string, which must be interpreted as bytes, not code points. How then do you convert these into a Unicode string? (Python 2 did not have this problem, since you simply output a byte string without caring about the encoding). On Sun, Jul 13, 2008 at 9:10 AM, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Very nice, I had this somewhere on my todo list to work on. I'm very much > > in favour, especially since it synchronizes us with the RFCs (for all I > > remember reading about it last time). > > I still think that it doesn't. The RFCs haven't changed, and can't > change for compatibility reasons. The encoding of non-ASCII characters > in URLs remains as underspecified as it always was. Correct. But my patch brings us in-line with that unspecification. The unpatched version forces you to use Latin-1. My patch lets you specify the encoding to use. > Now, with IRIs, the situation is different, but I don't think the patch > claims to implement IRIs (and if so, it perhaps shouldn't change URL > processing in doing so). True. I don't claim to have implemented IRIs or even know enough about them to do that. I'll read up on these things in the next few days. However, this is a URI library, not IRI. From what I've seen, it's percent-encoded URIs coming in from the browser, not IRIs. We just need to make sure with this patch that IRIs don't become less-supported than they were before; don't need to explicitly support them. Cheers, Matt Giuca
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com