On 01/10/12 18:07, chris.d...@gmail.com wrote:
>     * Use bytes or str for environ keys?
>     * Use bytes or str for environ values?

str, decoded from the request bytes using ISO-8859-1.

>       * Are all environ values created equal or would, for example,
>         QUERY_STRING's value (prior to any parameter to decoding)
>         be handled differently from HTTP_COOKIE

All environ values are created equal (other than the CGI-mandated odd decoding behaviour of SCRIPT_NAME and PATH_INFO).

>       * If str, I see that ISO-8859-1 is the assumed encoding. How much
>         hurt occurs in the world if I just assume utf-8 when decoding to
>         str[4]?

Immediately, all non-ASCII characters in the path would be interpreted incorrectly.

The more general hurt to the world would be that we would continue the sad pre-PEP3333 situation where every web server handles non-ASCII characters differently, and so no WSGI application can reliably use Unicode in path segments.

There is little impact to any header other than the path, because non-ASCII characters almost never appear in them. The query string remains %-encoded so any non-ASCII characters are safe. The other places users can put non-ASCII characters are in cookies and HTTP Basic Authorisation headers, but browser support here is so variable/broken that Python's handling would be the least of your worries.

> [4] Which is what it should have been all along?

Not necessarily. Even if you decide that all web apps must use UTF-8 for text encoding, it's valid to have URL-encoded, non-text binary data in a path segment. This would be unrecoverable using straight UTF-8.

(They would be recoverable if surrogateescape were used, but PEP 3333 has to encompass language versions that don't have surrogateescape, and also it's questionable whether it should be possible to smuggle non-UTF-8 data into strings that applications assume are safe.)

Plus header values are less likely to be UTF-8, and HTTP specifies that they're ISO-8859-1 (even if that is not well-observed by browsers).

Ideally, the interfaces should all be bytes, because HTTP is defined in terms of bytes. But that plays poorly with Python 3's default Unicode strs (for environ et al). So ISO-8859-1 was chosen as a str interface for which the original bytes can at least be recovered.

>     * Should start_response only accept bytes (and error if not), or
>       should it also accept str and encode appropriately?

status and response_headers are, like the request headers, native str (to be ISO-8859-1 encoded). It's only the HTTP entity body that is always bytestring.

>     * Should the returned iterable be rejected or encoded if not bytes?

I don't think it's specified by the PEP, but wsgiref looks like it'll chuck TypeError when it tries to write str to the buffer/socket.

cheers,

--
And Clover
mailto:a...@doxdesk.com
http://www.doxdesk.com/
gtalk:chat?jid=bobi...@gmail.com
_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to