On Wed, 05 Aug 2009 02:01:59 +0200, Ian Hickson <i...@hixie.ch> wrote:
I'm pretty sure that character encoding support in browsers is more of a
"collect them all" kind of thing than really based on content that
requires it, to be honest.

Really? I think a lot of them are actually used. If you know anything I'd love to trim the amount of encodings the Web needs to a smaller list than what we currently ship with. Ideally this becomes a fixed list across all Web languages.


If someone can provide a firm list of encodings that they are confident
are required for a certain substantial percentage of the Web, I'm happy to add the list to the spec.

Can you not do a survey on your large dataset of data to find this out? I read somewhere also that Adam Barth was able to add code to Google Chrome to figure out a better algorithm for Content-Type sniffing. Maybe something similar could be done here?


We've encountered problems by the way with using the Unicode encoding matching algorithm. Particularly on some Asian sites. I think we need to switch HTML5 back to something more akin to WebKit/Gecko/Trident. I realize this means more magic lists, but the current algorithm does not seem to cut it. E.g. sites rely on the fact that EUC_JP is not a recognized encoding but EUC-JP is.


--
Anne van Kesteren
http://annevankesteren.nl/

Reply via email to