Hello, Ian said: > Having two ways of expressing the same information will lead to bugs > related to which data is canonical. If an application is using > SCRIPT_NAME/PATH_INFO and then updates those values in any way, and > wsgi.raw_script_name/wsgi.raw_path_info are present, then there will be > weird bugs and code will disagree about which one is correct. Since %2f > can exist in the raw versions, there isn't even a way to chunk the two > variables in the same way.
I can't agree more. I would propose the following, and excuse me in advance if this has already been proposed and discarded -- I've tried to follow this topic on the mailing list over the past few months, until it becomes an endless discussion. I think only the raw values should be available. Even if a middleware changes them, it must put them with raw values. And because you cannot change those values without knowing what encoding the request uses, the character encoding *must* be present. I know that sounds easy but it's not, because browsers don't specify the charset in the Content-Type and instead they generate a new request using the charset from the previous response. So the charset is unknown to the server/gateway and the middleware stack. So, what we could do is introduce a mandatory variable called, say, wsgi.charset, and would be used as follows: - It MUST be set by the server or gateway on every request. - Every middleware or application that reads or writes these values MUST use the charset specified in wsgi.charset. - If a server, gateway, middleware or application wants to change the charset and it is possible*, it MUST convert the *entire* request into that charset and update wsgi.charset accordingly. - When the charset is not specified in the HTTP request, UTF-8 MUST be assumed by the server/gateway. Unless another default charset has been specified by the user. I think/hope that will solve all the problems. What happens when a WSGI application is actually made up two WSGI applications and they send the responses in different charsets? If it's not possible to configure them so that they both use the same charsets, then one of them would have to be wrapped by a middleware which: - On egress, converts the responses using the charset used by the other application. - On ingress, if the charset is not specified in the request, it will assume it's the one used by the other application, and thus it will convert the request using the charset supported by the wrapped application. It would look like this: === def application(environ, start_response): if environ.startswith("/trac/"): # Say Trac only supports Latin-1 and we want responses to use UTF-8: app = trac.web.main.dispatch_request app = CharsetNormalizer(app, response="latin-1", request="utf8") else: # myapp uses UTF-8 app = myapp return app(environ, start_response) === Then there's the string vs bytes issue. Bytes would be the natural choice to represent these raw values, but it would probably cause more trouble than they solve. So, I think they should be strings that contain the the ASCII raw encoded values (i.e., str on both versions of Python). What do you think about this? Again, sorry if this has been discarded before! :) * For example, you can always convert Latin-1 to UTF-8, but not every UTF-8 string can be converted to Latin-1. -- Gustavo Narea <xri://=Gustavo>. | Tech blog: =Gustavo/(+blog)/tech ~ About me: =Gustavo/about | _______________________________________________ Web-SIG mailing list Web-SIG@python.org Web SIG: http://www.python.org/sigs/web-sig Unsubscribe: http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com