Gregor Schneider wrote:

And it's getting really nuts, when it comes to UTF-8: Talking about
UTF-8 with or without BOM? Even the specs are not clear about that.

Actually, a UTF-8 stream should /never/ need a BOM, because there is no byte-order, UTF-8 being by definition byte-oriented. The only problem is that, for instance MS-Windows Notepad adds a BOM to any text file it saves as UTF-8. Is anyone surprised ?

Another, linked issue is this :
If you edit and save as UTF-8 an html page using, for example, Notepad, it will always prefix the file with such a totally superfluous BOM. If you later serve this page with Apache or Tomcat, to an Internet Explorer browser, using no matter which HTTP Content-Type + charset header, Internet Explorer will see the BOM and decide that this page is encoded in UTF-8, no matter what any meta tag in the page says.

In my oppinion, the whole character-set is a pain in the ass:
I agree with that.


I personally wish IETF came up with some specs saying something like
"the first n bytes of any stream have to be encoded in ASCII containg
length and encoding-type of the rest of the stream".
I agree with that too, in general terms.
I believe that any file, any stream, should start with such a prefix, indicating at least the file's MIME type, charset and encoding (size may be unknown at that point), with a default of "text/plain", Unicode and UTF-8. I also believe there should be a HTTP 2.0 specification, specifying in clear terms a default Unicode/UTF-8 encoding for URLs, html pages, form data submission and so on, and a non-ambiguous way of deviating from that.

The problem is in bringing this about.

I put that on my whishlist for xmas.
That's nice, but you would have to start by convicing Santa Klaus.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to