Re: [whatwg] Internal character encoding declaration

Peter Karlsson Tue, 14 Mar 2006 00:05:04 -0800

Henri Sivonen on 2006-03-11:

I think it would be beneficial to additionally stipulate that
1. The meta element-based character encoding information declarationis expected to work only if the Basic Latin range of characters mapsto the same bytes as in the US-ASCII encoding.
Is this realistic? I'm not really familiar enough with characterencodings to say if this is what happens in general.
I suppose it is realistic. See below.

Yes, for most encodings, the US-ASCII range is the same, and if you restrictit a bit further (the "INVARIANT" charset in RFC 1345), it covers most ofthe ambiguous encodings. The others can be easily detected as they usuallyhave very different bit patterns (EBCDIC) or word lengths (UTF-16, UTF-32).

2. If there is no external character encoding information nor a BOM(see below), there MUST NOT be any non-ASCII bytes in the documentbyte stream before the end of the meta element that declares thecharacter encoding. (In practice this would ban unescaped non-ASCIIclass names on the html and [head] elements and non-ASCII comments atthe beginning of the document.)
Again, can we realistically require this? I need to do some studies ofnon-latin pages, I guess.
As UA behavior, no. As a conformance requirement, maybe.

If you require browsers to switch on-the-fly, they can redo the decodingwhen they find the <meta> anyway, and this is no longer a problem. There area lot of documents with non-ASCII-language comments and <title> tags thatare positioned before the <meta>.

Authors should avoid including inline character encodinginformation. Character encoding information should instead beincluded at the transport level (e.g. using the HTTP Content-Typeheader).
I disagree.
With HTML with contemporary UAs, there is no real harm in includingthe character encoding information both on the HTTP level and in themeta as long as the information is not contradictory. On the contrary,the author-provided internal information is actually useful when endusers save pages to disk using UAs that do not reserialize withinternal character encoding information.
...and it breaks everything when you have a transcoding proxy, orsimilar.
Well, not until you save to disk, since HTTP takes precedence. However,authors can escape this by using UTF-8. (Assuming here that tampering withUTF-8 would be harmful, wrong and pointless.)
Interestingly, transcoding proxies tend to be brought up by residents ofWestern Europe, North America or the Commonwealth. I have never seen aRussion person living in Russia or a Japanese person living in Japan talkabout transcoding proxies in any online or offline discussion. That's whyI doubt the importance of transcoding proxies.

Transcoding is very popular, especially in Russia. With mod_charset inApache it will (AFAICT) use the information in the <meta> of the document todetermine the source encoding and then transcode it to an encoding itbelieves the client can handle (based on browser sniffing). It transcodes ona byte level, so the <meta> reamains unchanged, but is overridden by theHTTP header.

The <meta> tag is really information to the server, it is the server that is*supposed* to read it and post the data into the HTTP header. Unfortunatelynot many servers support that, leaving us with having to parse them in thebrowsers instead. Reading the <meta> tag for encoding information isbasically at the same level as guessing the encoding by frequencyanalysis--The server didn't say anything so perhaps you can get lucky.

Character encoding information shouldn't be duplicated, IMHO, that'sjust asking for trouble.
I suggest a mismatch be considered an easy parse error and, therefore,reportable.


That will not work in the mod_charset case above.

For HTML, user agents must use the following algorithm in determining the
character encoding of a document:
1. If the transport layer specifies an encoding, use that.
Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only;UTF-32 makes no practical sense for interchange on the Web.)
I don't know, should there?
I believe there should.

BOM-sniffing should be done *after* looking at the transport layer'sinformation. It might know something you don't. It's a part of the"guessing-the-content" step.

Requirements I'd like to see:
Documents must specify a character encoding an must use an IANA-registeredencoding and must identify it using its preferred MIME name or use a BOM(with UTF-8, UTF-16 or UTF-32). UAs must recognize the preferred MIME nameof every encoding they support that has a preferred MIME name. UAs shouldrecognize IANA-registered aliases.

That could be useful, the only problem being that the IANA list of encodinglabels is a bit difficult to read when you want to try figuring out whichname to write.

Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e.BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDICfamily of encodings. Documents using the UTF-16 or UTF-32 encodings musthave a BOM.

I don't think forbidding BOCU-1 is a good idea. If there is ever a properspecification written of it, it could be very useful as a compression formatfor documents.

Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)


Yes, especially since encoding defintions tend to change over time.

Authors are adviced to use the UTF-8 encoding. Authors are adviced not touse the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on theWeb is harmful and utterly pointless, but Firefox and Opera support it.

UTF-32 can be useful as an internal format, but I agree that it's not veryuseful on the web. Not sure about the "harmful" bit, though.


--
\\//
Peter, software engineer, Opera Software

 The opinions expressed are my own, and not those of my employer.
 Please reply only by follow-ups on the mailing list.

Re: [whatwg] Internal character encoding declaration

Reply via email to