Hi.
I have been wondering for a while about how a server application should
really consider the "query string" part of a URL, in terms of character
encoding. I am talking here of a URL of the form
http://hostname/somepath?name1=value1&name2=value2..&nameN=valueN
(the part after the question mark)
Starting with a quote from
http://www.w3.org/TR/html401/interact/forms.html#h-17.3 :
accept-charset = charset list [CI]
This attribute specifies the list of character encodings for input
data that is accepted by the server processing this form. The value is a
space- and/or comma-delimited list of charset values. The client must
interpret this list as an exclusive-or list, i.e., the server is able to
accept any single character encoding per entity received.
The default value for this attribute is the reserved string
"UNKNOWN". User agents may interpret this value as the character
encoding that was used to transmit the document containing this FORM
element.
Some people (to which I belong), after trying to digest the various RFCs
and other recommendations that seem to deal with the subject (e.g.
RFC3986 and the document above), come to the conclusion that the
character set and/or encoding of the query string, after
percent-decoding, is basically undefined from a server's point of view.
Others seem to be convinced that it is Unicode encoded as UTF-8.
Yet others that it is, by default, iso-8859-1.
Now what is it ?
If I take the above quotation for instance, the part "User agents *may*
interpret " (the emphasis is mine only) kind of bothers me, in the sense
that it implies that the browser can do what it wants anyway.
The other part that bothers me is that according to the above, the
"accept-charset" attribute can specify *a list* of character encodings,
and not just one.
Then the above goes on to say "the server is able to accept any single
character encoding per entity received". What in this case is an
"entity" ? are we talking about the whole form submission, like in
"query string", or are we talking individual data items, as in the
individual "name=value" pairs ?
So basically, what will the browser pick, and how would the server know
what it picked ?
One could argue that the server should only send forms as follows :
- the server response to the browser should contain a "Content-Type:"
header that specifies not only the Mime type "text/html" (or
equivalent), but add a "charset" attribute.
- the html document being sent should contain a <meta> tag that
explicitly provides the document charset/encoding, like
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />.
- the <form> in the document should specify an "accept-charset"
attribute, preferably with a single charset/encoding like "utf-8".
That's all nice and well, but
a) if this incoming URL is something typed by a user in the URL bar of
the browser, there is no such previous response sent by the server.
b) HTTP being a connection-less protocol, the server should anyway not
have any recollection that it has previously sent such a form to the
same browser (yesterday ?), so when a request comes in, the server
doesn't know any of these things above for sure
c) the browser may decide to do whatever it pleases and disregard what
the server told it (IE comes to mind, practical examples on request).
It should then be in violation of the specifications, but considering
the above I'm not so sure it is clear-cut.
For a while now, I have resorted to do all the things above, and in
addition to always sending forms specifying
"enctype=multipart/form-data", for which the problem should not exist.
In addition, I make sure that each form contains a hidden field, itself
containing a string with a content known to the application, which upon
form submission can be checked for any discrepancy (at least between
UTF-8 and an ISO-8859 encoding; it can unfortunately not distinguish
between different iso-8859 encodings).
But that seems like some hideous overkill, and still not totally foolproof.
(multipart/form-data also has the inconvenient that it does not play
very well with some authentication schemes using redirects)
It seems to me that the specifications are still not clear and/or not
tight enough.
Am I missing something ?
(And yes I know about PUNYCODE, but in my understanding that applies to
DNS hostnames, not to query strings.)
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
" from the digest: users-digest-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org