Re: Tomcat 5 and UTF-8

André Warnier Fri, 03 Apr 2009 15:44:08 -0700

Hi.

One of my preferred subjects...

1) as per the HTTP specs, the server should send a Content-Type headeralong with any response to a browser. If the response is of the generaltype "text", then this Content-Type header should also contain a charsetattribute, indicating the character set and the encoding.If not indicated, this defaults to iso-8859-1 (which is a charset and an8-bit encoding).Apache and Tomcat normally do that, but a badly-written application canoverride that and screw things up. There are also cases where Apacheand Tomcat genuinely do not know, as when picking up a file from disk,and have to pick either the default iso-8859-1 or what theirconfiguration specifies as a default.

Of course this is sometimes wrong.

2) also per the HTTP specs, when the server sends a Content-Type header,the client (browser) should not second-guess the server. It shouldaccept and respect the header in order to interpret the content.Major discrepancy : all versions of IE which I know of second-guess theserver, in clear violation of the HTTP specs, and make their owninspection and heuristic determination of the content received, andunfortunately they get it wrong in a number of cases. Unfortunatelyalso, since IE still accounts for over 90% of the browsers used incorporate environments, the poor webapp programmer is forced to takethis bad behaviour into account.

3) If the server sends back a document prefixed by a BOM, then IE alsoautomatically interprets the documents as being Unicode, no matter whatthe server (or the document) say. This is stupid because a UTF-8encoded document does not need a BOM, considering it is a byte-orientedencoding anyway, with no possibility of getting a byte-order wrong.Windows Notepad saves all Unicode documents with a BOM, even when savingthem as UTF-8.

4) the HTML specs are distinct from the HTTP specs. In the HTML specs,there exists a <meta HTTP-equiv="Content-Type" ..> tag, which supposedlycan contain a charset indication about the content of this HTML page.I personally find this rather clumsy, because the client has to startreading and decoding the HTML document before it can read and interpretthis header, so its real practical significance is doubtful. It alsoseems to be superfluous and confusing considering (1) and (2) above.(Like, what if (1) and (4) specify different charsets/encodings ?).But ok, it might be of some use for HTML editors, which could use thisto try to interpret correctly a document loaded from disk, in which casethere is no Content-Type sent by a server.

5) as well the HTTP specs as the HTML specs, are still not entirelyprecise nor unambiguous about some aspects of the general character setissues. For example, when a POST request contains data encoded as"URL-encoded". Also, even modern browsers (including Firefox 3) do notproperly specify the encoding of multi-part POSTs.

6) encoding rules are different for the URLs, for the HTTP headers, andfor the content. Even a URL has two distinct types of encoding : thepart for the hostname (Punycode, rfc 3492), and the part for the pathand query-string (charset unspecified, percent-encoding).

7) It never ceases to amaze me, the amount of productive time lost everyyear with character set issues on the web, when Unicode/UTF-8 has beenaround for several years as a charset/encoding covering all languagesknown to man and beyond. Why hasn't a proposal for HTTP 2.x / HTML 5.xcome about, reconciling those aspects and establishing Unicode/UTF-8 asthe default (or only) encoding, for URLs as well as content ?

8) What is also missing in my view, is some more general proposalcovering the format of text files (and text streams), anywhere. Toalleviate any ambiguity, each text file/stream should contain at least ashort prefix indicating its MIME type and its charset/encoding.

All the above is why I keep on seeing my name echoed back to me asAndrÃ©, even on some well-known supposedly international websites.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Tomcat 5 and UTF-8

Reply via email to