Hi.

One of my preferred subjects...

1) as per the HTTP specs, the server should send a Content-Type header along with any response to a browser. If the response is of the general type "text", then this Content-Type header should also contain a charset attribute, indicating the character set and the encoding. If not indicated, this defaults to iso-8859-1 (which is a charset and an 8-bit encoding). Apache and Tomcat normally do that, but a badly-written application can override that and screw things up. There are also cases where Apache and Tomcat genuinely do not know, as when picking up a file from disk, and have to pick either the default iso-8859-1 or what their configuration specifies as a default.
Of course this is sometimes wrong.

2) also per the HTTP specs, when the server sends a Content-Type header, the client (browser) should not second-guess the server. It should accept and respect the header in order to interpret the content. Major discrepancy : all versions of IE which I know of second-guess the server, in clear violation of the HTTP specs, and make their own inspection and heuristic determination of the content received, and unfortunately they get it wrong in a number of cases. Unfortunately also, since IE still accounts for over 90% of the browsers used in corporate environments, the poor webapp programmer is forced to take this bad behaviour into account.

3) If the server sends back a document prefixed by a BOM, then IE also automatically interprets the documents as being Unicode, no matter what the server (or the document) say. This is stupid because a UTF-8 encoded document does not need a BOM, considering it is a byte-oriented encoding anyway, with no possibility of getting a byte-order wrong. Windows Notepad saves all Unicode documents with a BOM, even when saving them as UTF-8.

4) the HTML specs are distinct from the HTTP specs. In the HTML specs, there exists a <meta HTTP-equiv="Content-Type" ..> tag, which supposedly can contain a charset indication about the content of this HTML page. I personally find this rather clumsy, because the client has to start reading and decoding the HTML document before it can read and interpret this header, so its real practical significance is doubtful. It also seems to be superfluous and confusing considering (1) and (2) above. (Like, what if (1) and (4) specify different charsets/encodings ?). But ok, it might be of some use for HTML editors, which could use this to try to interpret correctly a document loaded from disk, in which case there is no Content-Type sent by a server.

5) as well the HTTP specs as the HTML specs, are still not entirely precise nor unambiguous about some aspects of the general character set issues. For example, when a POST request contains data encoded as "URL-encoded". Also, even modern browsers (including Firefox 3) do not properly specify the encoding of multi-part POSTs.

6) encoding rules are different for the URLs, for the HTTP headers, and for the content. Even a URL has two distinct types of encoding : the part for the hostname (Punycode, rfc 3492), and the part for the path and query-string (charset unspecified, percent-encoding).

7) It never ceases to amaze me, the amount of productive time lost every year with character set issues on the web, when Unicode/UTF-8 has been around for several years as a charset/encoding covering all languages known to man and beyond. Why hasn't a proposal for HTTP 2.x / HTML 5.x come about, reconciling those aspects and establishing Unicode/UTF-8 as the default (or only) encoding, for URLs as well as content ?

8) What is also missing in my view, is some more general proposal covering the format of text files (and text streams), anywhere. To alleviate any ambiguity, each text file/stream should contain at least a short prefix indicating its MIME type and its charset/encoding.

All the above is why I keep on seeing my name echoed back to me as André, even on some well-known supposedly international websites.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to