[Web-SIG] WSGI 2: Decoding the Request-URI

Robert Brewer Sun, 16 Aug 2009 20:07:53 -0700

I wrote:
> PATH_INFO and QUERY_STRING are ... decoded via a configurable
> charset, defaulting to UTF-8. If the path cannot be decoded
> with that charset, ISO-8859-1 is tried. Whichever is successful
> is stored at environ['REQUEST_URI_ENCODING'] so middleware and
> apps can transcode if needed.


and Ian replied:
> My understanding is that PATH_INFO *should* be UTF-8 regardless of
> what encoding a page might be in.  At least that's what I got when
> testing Firefox.  It might not be valid UTF-8 if it was manually
> constructed, but then there's little reason to think it is valid...

Actually, current browsers tend to use UTF-8 for the path, and either the 
encoding of the document [1] or Windows-1252 [2] for the querystring. But the 
vast majority of HTTP user agents are not browsers [3]. Even if that were not 
so, we should not define WSGI to only interoperate with the most current 
browsers.

and Graham added:
> Thinking about it for a while, I get the feel that having a fallback
> to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That
> URLs wouldn't consistently use the same encoding all the time just
> seems wrong. I would see it as returning a bad request status. If an
> application coder knows they are actually going to be dealing with
> latin-1, as that is how the application is written, then they should
> be specifying it should be latin-1 always instead of utf-8. Thus, the
> WSGI adapter should provide a means to override what encoding is used.

Applications do produce URI's (and IRI's, etc. that need to be converted into 
URI's) and do transfer them in media types like HTML, which define how to 
encode a.href's and form.action's before %-encoding them [4]. But these are not 
the only vectors by which clients obtain or generate Request-URI's.

> For simple WSGI adapters which only service one WGSI application, then
> it would apply to whole URL namespace.

As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a 
filename encoding defined by the OS which is different than that of the rest of 
the URI's generated/understood by even the most coherent application.

The encoding used for a URI is only really important for one reason: URI 
comparison. Comparison is at the heart of handler dispatch, static resource 
identification, and proper HTTP cache operation. It is for these reasons that 
RFC 3986 has an extensive section on the matter [5], including a "ladder" of 
approaches:

 * Simple String Comparison
 * Case Normalization (e.g. /a%3D == /a%3d)
 * Percent-Encoding Normalization (e.g. /a%62c == /abc)
 * Path Segment Normalization (e.g. /abc/../def == /def)
 * Scheme-Based Normalization (e.g. http://example.com == 
http://example.com:80/)
 * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing 
showed it to be)

I think it would be beneficial to those who develop WSGI application interfaces 
to be able to assume that at least case-, percent-, path-, and 
scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by 
all WSGI 2 origin servers.

All of those except for the first one can be accomplished without decoding the 
target URI. But that first section specifically states: "In practical terms, 
character-by-character comparisons should be done codepoint-by-codepoint after 
conversion to a common character encoding." In other words, the URI spec seems 
to imply that the two URI's "/a%c3%bf" and "/a%ff" may be equivalent, if the 
former is u"/a\u00FF" encoded in UTF-8 and the latter is u"/a\u00FF" encoded in 
ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ 
values must be byte strings. IMO WSGI 2 should do better in this regard.

> For something like Apache where
> could map to multiple WSGI applications, then it may want to provide
> means of overriding encoding for specific subsets o URLs, ie., using
> Location directive for example.

For the three reasons above, I don't think we can assume that the application 
will always receive equivalent URI's encoded in a single, foreseen encoding. 
Yet we still haven't answered the question of how to handle unforeseen 
encodings. You're right that, if the server-side stack as a whole cannot map a 
particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 
over 400, but either is fine.

However, we quite often use only a portion of the URI when attempting to locate 
an appropriate handler; sometimes just the leading "/" character! The remaining 
characters are often passed as function arguments to the handler, or stuck in 
some parameter list/dict. In many cases, the charset used to decode these 
values either: is unimportant; follows complex rules from one resource to 
another; or is merely reencoded, since the application really does care about 
bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI 
environ entry to declare the charset which was used to decode) can handle all 
of these cases. Server configuration options cannot, at least not without their 
specification becoming unwieldy.


Robert Brewer
fuman...@aminus.org

[1] http://markmail.org/message/r6qzszybsk5pwzbt
[2] http://markmail.org/message/47cekkpvdjaectvi
[3] http://markmail.org/message/3bsxo7q6eztcp3yo
[4] http://www.w3.org/TR/html4/interact/forms.html#idx-character_encoding
[5] http://tools.ietf.org/html/rfc3986#section-6
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

[Web-SIG] WSGI 2: Decoding the Request-URI

Reply via email to