Oleg Kalnichevski wrote:

This is one of many 'shady' areas of the HTTP spec. Basically there is
no standard way for the client to communicate to the server what coding
has been used to decode query parameters.

It's definitely shady. I've seen two approaches used here. In the past, many internationalized applications would assume that the non-ASCII encoded characters in submitted URIs were in the same character set as the page that was submitting the request. So if you know that you generated foo.jsp in Latin-5, then you assume that any URIs requests coming from foo.jsp should be treated as Latin-5 after being URL-decoded. There's a paper on this technique floating around somewhere, written by a guy I used to work with at IBM, but I can't find it on the Web.

The more modern approach is to assume that the URI is always in UTF-8. If there are any non-ASCII characters in it after URL-decoding, then you run it through a UTF-8 converter (UTF-8 to UTF-16 in the case of Java). Here's a proposal on this: http://www.w3.org/International/O-URL-and-ident.html. If you follow the links from there you'll find other useful pages such as http://www.w3.org/International/questions/qa-forms-utf-8.html.

-- Laura


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to