Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-20 Thread Henry Precheur
On Sun, Aug 16, 2009 at 08:06:03PM -0700, Robert Brewer wrote:
 However, we quite often use only a portion of the URI when attempting
 to locate an appropriate handler; sometimes just the leading /
 character! The remaining characters are often passed as function
 arguments to the handler, or stuck in some parameter list/dict. In
 many cases, the charset used to decode these values either: is
 unimportant; follows complex rules from one resource to another; or is
 merely reencoded, since the application really does care about bytes
 and not characters. Falling back to ISO-8859-1 (and minting a new WSGI
 environ entry to declare the charset which was used to decode) can
 handle all of these cases. Server configuration options cannot, at
 least not without their specification becoming unwieldy.

(Just to make things clear, I am not just talking about REQUEST_URI
here, but all request headers)


Encoding everything using ISO-8859-1 has the nice property of keeping
informations intact. It would be good heuristic if everything with a few
exceptions was encoded using ISO-8859-1. Just transcode the few
problematic cases at the application level and everybody is happy. A
string encoded from ISO-8859-1 is like a bytes object with a string
'interface' on top of it.


But it sweep the encoding problem under the carpet. The problem with
Python 2 was that str and unicode were almost the same, so much the same
that it was possible to mix them without too much problems:

   'foo' == u'foo'
  True

Python 3 made bytes and string 'incompatible' to force programmers to
handle the encoding problem as soon as possible:

   b'foo' == 'foo'
  False

By passing `str()` to the application, the application author could
believe that the encoding problem has been handled. But in most cases it
hasn't been handled at all. The application author should still
transcode all the strings incorrectly encoded. We are back to Python 2's
bad old days, where we can't be sure that what we got is properly
encoded:

  Was that string encoded using latin-1? Maybe a middleware transcoded
  it to UTF-8 before the application was called. Maybe the application
  itself transcoded it at some point, but then we need to keep track of
  what was transcoded. Maybe the application should transcode everything
  when it is called.

Also EVERY application author will have to read the PEP, especially the
paragraph saying:

   Everything we give you are strings, but you still have to deal
   with the encoding mess.

Otherwise he will have weird problems like when he was using Python 2.
Because the interface is not clear. strings are supposed to be text and
only text. Encoding everything to ISO-8859-1 means strings are not text
anymore, they are 'encoded data' [1].


bytes are supposed to be 'encoded data' and binary blobs. By giving
applications bytes, the author knows right away he should decode them.
No need to read the PEP.


`bytes` can do everything `str` can do with the notable exception of
'format'.

   b'foo bar'.title()
  b'Foo Bar'

   b'/foo/bar/fran\xc3ois'.split(b'/')
  [b'', b'foo', b'bar', b'fran\xc3ois']

   re.match(br'/bar/(\w+)/(\d+)', b'/bar/foo/1234').groups()
  (b'foo', b'1234')

I understand that `bytes()` is an unfamiliar beast. But I believe the
encoding problem is the realm of the application, not the realm of the
gateway. Let the application handle the encoding problem and don't give
it a half baked solution.


Using bytes also has its set of problems. The standard library doesn't
support bytes very well. For example urllib.response.unquote() doesn't
work with bytes, and urllib.parse too has issues.

[1] 
http://docs.python.org/3.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

-- 
  Henry Pr?cheur
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-17 Thread Robert Brewer
I wrote:
 Applications do produce URI's (and IRI's, etc. that need to be
 converted into URI's) and do transfer them in media types like
 HTML, which define how to encode a.href's and form.action's
 before %-encoding them [4]. But these are not the only vectors
 by which clients obtain or generate Request-URI's.
 ...
 As someone (Alan Kennedy?) noted at PyCon, static resources may
 depend upon a filename encoding defined by the OS which is
 different than that of the rest of the URI's generated/understood
 by even the most coherent application.
 ...
 In practical terms, character-by-character comparisons should be
 done codepoint-by-codepoint after conversion to a common character
 encoding. In other words, the URI spec seems to imply that the
 two URI's /a%c3%bf and /a%ff may be equivalent, if the former
 is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF
 encoded in ISO-8859-1. Note that WSGI 1.0 cannot speak about
 this, since all environ values must be byte strings. IMO WSGI
 2 should do better in this regard.
 ...
 For the three reasons above, I don't think we can assume that the
 application will always receive equivalent URI's encoded in a
 single, foreseen encoding.

Did I say 3 reasons? I meant 4: Accept-Charset.


Robert Brewer
fuman...@aminus.org
___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


Re: [Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-17 Thread P.J. Eby

At 07:37 AM 8/17/2009 -0700, Robert Brewer wrote:

Did I say 3 reasons? I meant 4: Accept-Charset.


Chief amongst the reasons...  amongst our reasonry...  Right, we'll 
come in again.  ;-)


___
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com


[Web-SIG] WSGI 2: Decoding the Request-URI

2009-08-16 Thread Robert Brewer
I wrote:
 PATH_INFO and QUERY_STRING are ... decoded via a configurable
 charset, defaulting to UTF-8. If the path cannot be decoded
 with that charset, ISO-8859-1 is tried. Whichever is successful
 is stored at environ['REQUEST_URI_ENCODING'] so middleware and
 apps can transcode if needed.

and Ian replied:
 My understanding is that PATH_INFO *should* be UTF-8 regardless of
 what encoding a page might be in.  At least that's what I got when
 testing Firefox.  It might not be valid UTF-8 if it was manually
 constructed, but then there's little reason to think it is valid...

Actually, current browsers tend to use UTF-8 for the path, and either the 
encoding of the document [1] or Windows-1252 [2] for the querystring. But the 
vast majority of HTTP user agents are not browsers [3]. Even if that were not 
so, we should not define WSGI to only interoperate with the most current 
browsers.

and Graham added:
 Thinking about it for a while, I get the feel that having a fallback
 to latin-1 if string cannot be decoded as UTF-8 is a bad idea. That
 URLs wouldn't consistently use the same encoding all the time just
 seems wrong. I would see it as returning a bad request status. If an
 application coder knows they are actually going to be dealing with
 latin-1, as that is how the application is written, then they should
 be specifying it should be latin-1 always instead of utf-8. Thus, the
 WSGI adapter should provide a means to override what encoding is used.

Applications do produce URI's (and IRI's, etc. that need to be converted into 
URI's) and do transfer them in media types like HTML, which define how to 
encode a.href's and form.action's before %-encoding them [4]. But these are not 
the only vectors by which clients obtain or generate Request-URI's.

 For simple WSGI adapters which only service one WGSI application, then
 it would apply to whole URL namespace.

As someone (Alan Kennedy?) noted at PyCon, static resources may depend upon a 
filename encoding defined by the OS which is different than that of the rest of 
the URI's generated/understood by even the most coherent application.

The encoding used for a URI is only really important for one reason: URI 
comparison. Comparison is at the heart of handler dispatch, static resource 
identification, and proper HTTP cache operation. It is for these reasons that 
RFC 3986 has an extensive section on the matter [5], including a ladder of 
approaches:

 * Simple String Comparison
 * Case Normalization (e.g. /a%3D == /a%3d)
 * Percent-Encoding Normalization (e.g. /a%62c == /abc)
 * Path Segment Normalization (e.g. /abc/../def == /def)
 * Scheme-Based Normalization (e.g. http://example.com == 
http://example.com:80/)
 * Protocol-Based Normalization (e.g. /data == /data/ if previous dereferencing 
showed it to be)

I think it would be beneficial to those who develop WSGI application interfaces 
to be able to assume that at least case-, percent-, path-, and 
scheme-normalization are consistently performed on SCRIPT_NAME and PATH_INFO by 
all WSGI 2 origin servers.

All of those except for the first one can be accomplished without decoding the 
target URI. But that first section specifically states: In practical terms, 
character-by-character comparisons should be done codepoint-by-codepoint after 
conversion to a common character encoding. In other words, the URI spec seems 
to imply that the two URI's /a%c3%bf and /a%ff may be equivalent, if the 
former is u/a\u00FF encoded in UTF-8 and the latter is u/a\u00FF encoded in 
ISO-8859-1. Note that WSGI 1.0 cannot speak about this, since all environ 
values must be byte strings. IMO WSGI 2 should do better in this regard.

 For something like Apache where
 could map to multiple WSGI applications, then it may want to provide
 means of overriding encoding for specific subsets o URLs, ie., using
 Location directive for example.

For the three reasons above, I don't think we can assume that the application 
will always receive equivalent URI's encoded in a single, foreseen encoding. 
Yet we still haven't answered the question of how to handle unforeseen 
encodings. You're right that, if the server-side stack as a whole cannot map a 
particular URI to a handler, it should respond with a 4xx code. I'd prefer 404 
over 400, but either is fine.

However, we quite often use only a portion of the URI when attempting to locate 
an appropriate handler; sometimes just the leading / character! The remaining 
characters are often passed as function arguments to the handler, or stuck in 
some parameter list/dict. In many cases, the charset used to decode these 
values either: is unimportant; follows complex rules from one resource to 
another; or is merely reencoded, since the application really does care about 
bytes and not characters. Falling back to ISO-8859-1 (and minting a new WSGI 
environ entry to declare the charset which was used to decode) can handle all 
of these cases. Server configuration options cannot, at least