#19468: django doesn't encode request.path correctly in python3
---------------------------------+------------------------------------
     Reporter:  aliva            |                    Owner:  nobody
         Type:  Bug              |                   Status:  new
    Component:  Python 3         |                  Version:  master
     Severity:  Release blocker  |               Resolution:
     Keywords:                   |             Triage Stage:  Accepted
    Has patch:  1                |      Needs documentation:  0
  Needs tests:  0                |  Patch needs improvement:  0
Easy pickings:  0                |                    UI/UX:  0
---------------------------------+------------------------------------

Comment (by aaugustin):

 To the best of my understanding, Graham's answer on Python's tracker is
 correct, and there's no bug in Python.

 PEP3333 says that `environ` must contain native strings (`str` objects).
 When native strings are actually implemented with a unicode-aware type,
 only code points representable in ISO-8859-1 encoding may be used.

 One might disagree with the idea of using native strings for storing data
 that's really bytes, but it also has advantages  and it's the status quo.
 The point of PEP 3333 is to provide a stable API; it seems extremely
 unlikely to me that it'll change before years.

 ----

 Per RFC 3986 2.5:
 > When a new URI scheme defines a component that represents textual data
 consisting of characters from the Universal Character Set [UCS], the data
 should first be encoded as octets according to the UTF-8 character
 encoding [STD63]
 but HTTP is an "old" URI scheme per RFC 3987 6.4:
 > the HTTP URL scheme does not specify how to encode original characters.
 and just below there's an example of a non UTF-8 HTTP URL.

 Yes, modern browsers will '''nicely display''' utf-8 URLs, but that's just
 cosmetic. You can write a perfectly correct and RFC-compliant HTTP service
 that uses another charset in its URLs.

 WSGI uses `iso-8859-1` because every bytestring can be decoded with this
 charset. If it assumed `utf-8`, it would fail to decode some perfectly
 valid HTTP requests. WSGI wants to be universal and can't make 99%-correct
 assumptions.

 ----

 So, this has three practical consequences for us:
 - every HTTP requests can be unambiguously represented in WSGI, and the
 WSGI layer needs not be aware of the encoding of the URL (and of the rest
 of the HTTP request);
 - Django can recover the original bytestring of any `environ` value,
 including `environ['PATH_INFO']` with  `.encode('iso-8859-1')`;
 - Django must re-decode data fetched from `environ` with the appropriate
 charset.

 The next steps are:
 - audit where Django is reading data from `environ`;
 - determine which charset should be used for decoding.

 I find it reasonable to assume that URLs will use the same charset as HTTP
 responses; that means using
 `.encode('iso-8859-1').decode(settings.DEFAULT_CHARSET)`.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/19468#comment:4>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to