Hi.
One of my preferred subjects...
1) as per the HTTP specs, the server should send a Content-Type header
along with any response to a browser. If the response is of the general
type "text", then this Content-Type header should also contain a charset
attribute, indicating the character set and the encoding.
If not indicated, this defaults to iso-8859-1 (which is a charset and an
8-bit encoding).
Apache and Tomcat normally do that, but a badly-written application can
override that and screw things up. There are also cases where Apache
and Tomcat genuinely do not know, as when picking up a file from disk,
and have to pick either the default iso-8859-1 or what their
configuration specifies as a default.
Of course this is sometimes wrong.
2) also per the HTTP specs, when the server sends a Content-Type header,
the client (browser) should not second-guess the server. It should
accept and respect the header in order to interpret the content.
Major discrepancy : all versions of IE which I know of second-guess the
server, in clear violation of the HTTP specs, and make their own
inspection and heuristic determination of the content received, and
unfortunately they get it wrong in a number of cases. Unfortunately
also, since IE still accounts for over 90% of the browsers used in
corporate environments, the poor webapp programmer is forced to take
this bad behaviour into account.
3) If the server sends back a document prefixed by a BOM, then IE also
automatically interprets the documents as being Unicode, no matter what
the server (or the document) say. This is stupid because a UTF-8
encoded document does not need a BOM, considering it is a byte-oriented
encoding anyway, with no possibility of getting a byte-order wrong.
Windows Notepad saves all Unicode documents with a BOM, even when saving
them as UTF-8.
4) the HTML specs are distinct from the HTTP specs. In the HTML specs,
there exists a <meta HTTP-equiv="Content-Type" ..> tag, which supposedly
can contain a charset indication about the content of this HTML page.
I personally find this rather clumsy, because the client has to start
reading and decoding the HTML document before it can read and interpret
this header, so its real practical significance is doubtful. It also
seems to be superfluous and confusing considering (1) and (2) above.
(Like, what if (1) and (4) specify different charsets/encodings ?).
But ok, it might be of some use for HTML editors, which could use this
to try to interpret correctly a document loaded from disk, in which case
there is no Content-Type sent by a server.
5) as well the HTTP specs as the HTML specs, are still not entirely
precise nor unambiguous about some aspects of the general character set
issues. For example, when a POST request contains data encoded as
"URL-encoded". Also, even modern browsers (including Firefox 3) do not
properly specify the encoding of multi-part POSTs.
6) encoding rules are different for the URLs, for the HTTP headers, and
for the content. Even a URL has two distinct types of encoding : the
part for the hostname (Punycode, rfc 3492), and the part for the path
and query-string (charset unspecified, percent-encoding).
7) It never ceases to amaze me, the amount of productive time lost every
year with character set issues on the web, when Unicode/UTF-8 has been
around for several years as a charset/encoding covering all languages
known to man and beyond. Why hasn't a proposal for HTTP 2.x / HTML 5.x
come about, reconciling those aspects and establishing Unicode/UTF-8 as
the default (or only) encoding, for URLs as well as content ?
8) What is also missing in my view, is some more general proposal
covering the format of text files (and text streams), anywhere. To
alleviate any ambiguity, each text file/stream should contain at least a
short prefix indicating its MIME type and its charset/encoding.
All the above is why I keep on seeing my name echoed back to me as
André, even on some well-known supposedly international websites.
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org