#19468: django doesn't encode request.path correctly in python3
---------------------------------+------------------------------------
     Reporter:  aliva            |                    Owner:  nobody
         Type:  Bug              |                   Status:  new
    Component:  Python 3         |                  Version:  master
     Severity:  Release blocker  |               Resolution:
     Keywords:                   |             Triage Stage:  Accepted
    Has patch:  1                |      Needs documentation:  0
  Needs tests:  0                |  Patch needs improvement:  0
Easy pickings:  0                |                    UI/UX:  0
---------------------------------+------------------------------------

Comment (by aaugustin):

 Replying to [comment:7 claudep]:
 > I admit that due to an unfortunate missing standard in the past, URL
 encoded with non-utf-8 encodings are possible and correct RFC-wise.
 However, all modern browsers do encode the URLs with UTF-8, and that has
 nothing to do with "nicely displaying" them.

 This sentence is simplifying things a bit, because it ignores URL-
 encoding.

 Here's what browsers really do.

 '''1) When you type non-ASCII characters in a URL bar''', say
 `http://example.com/café/`, the browser will utf-8-encode and then url-
 encode it, resulting in `http://example.com/caf%C3%A9/`.

 To try it by yourself, run `nc -l 8000` in a console, and go to
 `http://localhost:8000/café/` with a browser. In the console you'll see:
 {{{
 GET /caf%C3%A9/ HTTP/1.1
 Host: localhost:8000
 ...
 }}}

 '''2) When an URL contains non-ASCII characters (which is illegal — URLs
 must be ASCII-only)''', browsers cope with the situation as above.

 I tested this in Firefox, Chrome and Safari by creating a file with the
 following content, and saving it with the iso-8895-1 encoding:
 {{{
 <html>
 <head><meta charset="iso-8859-1"><title>Test iso-8895-1
 link</title></head>
 <body><a href="http:/localhost:8000/café/">Café!</a></body>
 </html>
 }}}

 Clicking the link gives the same result as above in the console (which is
 a bit surprising — it would make sense to keep the original charset here).

 '''3) When an URL is properly URL-encoded''', browsers transmit it '''as
 is'''. The server can then URL-decode it and interpret it according to
 whatever charset it wants.

 I did the same test, but with a URL-encoded URL:

 {{{
 <html>
 <head><meta charset="iso-8859-1"><title>Test iso-8895-1
 link</title></head>
 <body><a href="http:/localhost:8000/caf%e9/">Café!</a></body>
 </html>
 }}}

 When clicking the link, that the browser sends the original URL
 (iso-8859-1 encoded, URL-encoded):
 {{{
 GET /caf%e9/ HTTP/1.1
 Host: localhost:8000
 }}}

 ----

 >  They really send utf-8-encoded paths on the wire.

 No, as demonstrated above.

 Browsers are notoriously robust to ill-formed inputs. A non-ASCII (ie. non
 URL-encoded) URL is an invalid input.

 Rather than reject it, browsers choose to encode it using utf-8, URL-
 encode the result, and use that. It's a good choice for error handling;
 being 99% correct is good enough when you're dealing with invalid content
 in the first place.

 But if a developer wants to write a Django server with Shift-JIS URLs — it
 may be more compact than utf-8 for asian languages — he's allowed to. If
 someone wants to replace a legacy sytem with ISO-8859-1 URLs with a Django
 version, she can. The URLs may not display nicely in browsers, but as long
 as they're properly URL-encoded, they'll work.

 Besides, browsers aren't the only consumers of HTTP content on the
 Internet.

 ----

 > If wsgiref/PEP 3333 chooses to continue to "wrongly" (but safely)
 decoding 98% of URLs, it might be a design choice and it remains to be
 seen if it is a problem or not. Backwards compatibility is also an issue
 here.

 I still think it's right (given the decision to use native strings in
 `environ`). If PEP 3333 decided to URL-decode and utf-8-decode URLs, it
 would prevent people from using any charset other than utf-8 in their
 URLs. I've given use cases for non-utf-8 URLs above.

 ----

 > As far as Django is concerned, I do agree with your next steps. But I'm
 -1 to using DEFAULT_CHARSET for decoding URLs. Django has absolutely no
 influence on the encoding of URL paths, that's the user agent's business.
 So even when you decide you want to serve non UTF-8 responses by setting
 DEFAULT_CHARSET, you still have no influence on the encoding of the paths
 you are receiving from clients

 I disagree. You have total control on the encoding of the paths you are
 receiving, ''as long as you <charset>-encode and URL-encode your URLs'',
 like you should. The UA '''must not''' perform any decoding or encoding on
 properly URL-encoded URLs.

 > (also taking into account hand-written URLs in browser address bars). In
 my opinion, these are orthogonal issues.

 Yes, hand written URLs are the only case where Django doesn't have
 control.

 Obviously, most regular websites will just use utf-8 everywhere, and that
 guarantees the best compatibility with the Web ecosystem.

 My point is to make it ''possible'' to use something else if one wants to
 and is aware of the consequences. That's why Django has a
 `DEFAULT_CHARSET` setting.

 ----

 If you think that Django should give up all pretense to support non-utf-8
 environments, that's another discussion!

-- 
Ticket URL: <https://code.djangoproject.com/ticket/19468#comment:8>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To post to this group, send email to django-updates@googlegroups.com.
To unsubscribe from this group, send email to 
django-updates+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to