#19468: django doesn't encode request.path correctly in python3
---------------------------------+------------------------------------
Reporter: aliva | Owner: nobody
Type: Bug | Status: new
Component: Python 3 | Version: master
Severity: Release blocker | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------------+------------------------------------
Comment (by aaugustin):
To the best of my understanding, Graham's answer on Python's tracker is
correct, and there's no bug in Python.
PEP3333 says that `environ` must contain native strings (`str` objects).
When native strings are actually implemented with a unicode-aware type,
only code points representable in ISO-8859-1 encoding may be used.
One might disagree with the idea of using native strings for storing data
that's really bytes, but it also has advantages and it's the status quo.
The point of PEP 3333 is to provide a stable API; it seems extremely
unlikely to me that it'll change before years.
----
Per RFC 3986 2.5:
> When a new URI scheme defines a component that represents textual data
consisting of characters from the Universal Character Set [UCS], the data
should first be encoded as octets according to the UTF-8 character
encoding [STD63]
but HTTP is an "old" URI scheme per RFC 3987 6.4:
> the HTTP URL scheme does not specify how to encode original characters.
and just below there's an example of a non UTF-8 HTTP URL.
Yes, modern browsers will '''nicely display''' utf-8 URLs, but that's just
cosmetic. You can write a perfectly correct and RFC-compliant HTTP service
that uses another charset in its URLs.
WSGI uses `iso-8859-1` because every bytestring can be decoded with this
charset. If it assumed `utf-8`, it would fail to decode some perfectly
valid HTTP requests. WSGI wants to be universal and can't make 99%-correct
assumptions.
----
So, this has three practical consequences for us:
- every HTTP requests can be unambiguously represented in WSGI, and the
WSGI layer needs not be aware of the encoding of the URL (and of the rest
of the HTTP request);
- Django can recover the original bytestring of any `environ` value,
including `environ['PATH_INFO']` with `.encode('iso-8859-1')`;
- Django must re-decode data fetched from `environ` with the appropriate
charset.
The next steps are:
- audit where Django is reading data from `environ`;
- determine which charset should be used for decoding.
I find it reasonable to assume that URLs will use the same charset as HTTP
responses; that means using
`.encode('iso-8859-1').decode(settings.DEFAULT_CHARSET)`.
--
Ticket URL: <https://code.djangoproject.com/ticket/19468#comment:4>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit https://groups.google.com/groups/opt_out.