#19468: django doesn't encode request.path correctly in python3 ---------------------------------+------------------------------------ Reporter: aliva | Owner: nobody Type: Bug | Status: new Component: Python 3 | Version: master Severity: Release blocker | Resolution: Keywords: | Triage Stage: Accepted Has patch: 1 | Needs documentation: 0 Needs tests: 0 | Patch needs improvement: 0 Easy pickings: 0 | UI/UX: 0 ---------------------------------+------------------------------------
Comment (by aaugustin): Replying to [comment:7 claudep]: > I admit that due to an unfortunate missing standard in the past, URL encoded with non-utf-8 encodings are possible and correct RFC-wise. However, all modern browsers do encode the URLs with UTF-8, and that has nothing to do with "nicely displaying" them. This sentence is simplifying things a bit, because it ignores URL- encoding. Here's what browsers really do. '''1) When you type non-ASCII characters in a URL bar''', say `http://example.com/café/`, the browser will utf-8-encode and then url- encode it, resulting in `http://example.com/caf%C3%A9/`. To try it by yourself, run `nc -l 8000` in a console, and go to `http://localhost:8000/café/` with a browser. In the console you'll see: {{{ GET /caf%C3%A9/ HTTP/1.1 Host: localhost:8000 ... }}} '''2) When an URL contains non-ASCII characters (which is illegal — URLs must be ASCII-only)''', browsers cope with the situation as above. I tested this in Firefox, Chrome and Safari by creating a file with the following content, and saving it with the iso-8895-1 encoding: {{{ <html> <head><meta charset="iso-8859-1"><title>Test iso-8895-1 link</title></head> <body><a href="http:/localhost:8000/café/">Café!</a></body> </html> }}} Clicking the link gives the same result as above in the console (which is a bit surprising — it would make sense to keep the original charset here). '''3) When an URL is properly URL-encoded''', browsers transmit it '''as is'''. The server can then URL-decode it and interpret it according to whatever charset it wants. I did the same test, but with a URL-encoded URL: {{{ <html> <head><meta charset="iso-8859-1"><title>Test iso-8895-1 link</title></head> <body><a href="http:/localhost:8000/caf%e9/">Café!</a></body> </html> }}} When clicking the link, that the browser sends the original URL (iso-8859-1 encoded, URL-encoded): {{{ GET /caf%e9/ HTTP/1.1 Host: localhost:8000 }}} ---- > They really send utf-8-encoded paths on the wire. No, as demonstrated above. Browsers are notoriously robust to ill-formed inputs. A non-ASCII (ie. non URL-encoded) URL is an invalid input. Rather than reject it, browsers choose to encode it using utf-8, URL- encode the result, and use that. It's a good choice for error handling; being 99% correct is good enough when you're dealing with invalid content in the first place. But if a developer wants to write a Django server with Shift-JIS URLs — it may be more compact than utf-8 for asian languages — he's allowed to. If someone wants to replace a legacy sytem with ISO-8859-1 URLs with a Django version, she can. The URLs may not display nicely in browsers, but as long as they're properly URL-encoded, they'll work. Besides, browsers aren't the only consumers of HTTP content on the Internet. ---- > If wsgiref/PEP 3333 chooses to continue to "wrongly" (but safely) decoding 98% of URLs, it might be a design choice and it remains to be seen if it is a problem or not. Backwards compatibility is also an issue here. I still think it's right (given the decision to use native strings in `environ`). If PEP 3333 decided to URL-decode and utf-8-decode URLs, it would prevent people from using any charset other than utf-8 in their URLs. I've given use cases for non-utf-8 URLs above. ---- > As far as Django is concerned, I do agree with your next steps. But I'm -1 to using DEFAULT_CHARSET for decoding URLs. Django has absolutely no influence on the encoding of URL paths, that's the user agent's business. So even when you decide you want to serve non UTF-8 responses by setting DEFAULT_CHARSET, you still have no influence on the encoding of the paths you are receiving from clients I disagree. You have total control on the encoding of the paths you are receiving, ''as long as you <charset>-encode and URL-encode your URLs'', like you should. The UA '''must not''' perform any decoding or encoding on properly URL-encoded URLs. > (also taking into account hand-written URLs in browser address bars). In my opinion, these are orthogonal issues. Yes, hand written URLs are the only case where Django doesn't have control. Obviously, most regular websites will just use utf-8 everywhere, and that guarantees the best compatibility with the Web ecosystem. My point is to make it ''possible'' to use something else if one wants to and is aware of the consequences. That's why Django has a `DEFAULT_CHARSET` setting. ---- If you think that Django should give up all pretense to support non-utf-8 environments, that's another discussion! -- Ticket URL: <https://code.djangoproject.com/ticket/19468#comment:8> Django <https://code.djangoproject.com/> The Web framework for perfectionists with deadlines. -- You received this message because you are subscribed to the Google Groups "Django updates" group. To post to this group, send email to django-updates@googlegroups.com. To unsubscribe from this group, send email to django-updates+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.