[users@httpd] what is the charset of a URL ?

André Warnier Sat, 07 Feb 2009 13:35:55 -0800

Hi.

I have been wondering for a while about how a server application shouldreally consider the "query string" part of a URL, in terms of characterencoding. I am talking here of a URL of the form

http://hostname/somepath?name1=value1&name2=value2..&nameN=valueN
(the part after the question mark)


Starting with a quote from
http://www.w3.org/TR/html401/interact/forms.html#h-17.3 :

accept-charset = charset list [CI]

This attribute specifies the list of character encodings for inputdata that is accepted by the server processing this form. The value is aspace- and/or comma-delimited list of charset values. The client mustinterpret this list as an exclusive-or list, i.e., the server is able toaccept any single character encoding per entity received.The default value for this attribute is the reserved string"UNKNOWN". User agents may interpret this value as the characterencoding that was used to transmit the document containing this FORMelement.

Some people (to which I belong), after trying to digest the various RFCsand other recommendations that seem to deal with the subject (e.g.RFC3986 and the document above), come to the conclusion that thecharacter set and/or encoding of the query string, afterpercent-decoding, is basically undefined from a server's point of view.

Others seem to be convinced that it is Unicode encoded as UTF-8.
Yet others that it is, by default, iso-8859-1.

Now what is it ?

If I take the above quotation for instance, the part "User agents *may*interpret " (the emphasis is mine only) kind of bothers me, in the sensethat it implies that the browser can do what it wants anyway.The other part that bothers me is that according to the above, the"accept-charset" attribute can specify *a list* of character encodings,and not just one.Then the above goes on to say "the server is able to accept any singlecharacter encoding per entity received". What in this case is an"entity" ? are we talking about the whole form submission, like in"query string", or are we talking individual data items, as in theindividual "name=value" pairs ?

So basically, what will the browser pick, and how would the server knowwhat it picked ?


One could argue that the server should only send forms as follows :

- the server response to the browser should contain a "Content-Type:"header that specifies not only the Mime type "text/html" (orequivalent), but add a "charset" attribute.- the html document being sent should contain a <meta> tag thatexplicitly provides the document charset/encoding, like

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />.

- the <form> in the document should specify an "accept-charset"attribute, preferably with a single charset/encoding like "utf-8".


That's all nice and well, but

a) if this incoming URL is something typed by a user in the URL bar ofthe browser, there is no such previous response sent by the server.b) HTTP being a connection-less protocol, the server should anyway nothave any recollection that it has previously sent such a form to thesame browser (yesterday ?), so when a request comes in, the serverdoesn't know any of these things above for surec) the browser may decide to do whatever it pleases and disregard whatthe server told it (IE comes to mind, practical examples on request).It should then be in violation of the specifications, but consideringthe above I'm not so sure it is clear-cut.

For a while now, I have resorted to do all the things above, and inaddition to always sending forms specifying"enctype=multipart/form-data", for which the problem should not exist.In addition, I make sure that each form contains a hidden field, itselfcontaining a string with a content known to the application, which uponform submission can be checked for any discrepancy (at least betweenUTF-8 and an ISO-8859 encoding; it can unfortunately not distinguishbetween different iso-8859 encodings).


But that seems like some hideous overkill, and still not totally foolproof.

(multipart/form-data also has the inconvenient that it does not playvery well with some authentication schemes using redirects)

It seems to me that the specifications are still not clear and/or nottight enough.


Am I missing something ?

(And yes I know about PUNYCODE, but in my understanding that applies toDNS hostnames, not to query strings.)






---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
  "   from the digest: users-digest-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

[users@httpd] what is the charset of a URL ?

Reply via email to