Re: Detection of unlabeled UTF-8

Neil Harris Fri, 06 Sep 2013 09:38:22 -0700

On 06/09/13 16:34, Gervase Markham wrote:


Data! Sounds like a plan.

Or we could ask our friends at Google or some other search engine to run
a version of our detector over their index and see how often it says
"UTF-8" when our normal algorithm would say something else.

Gerv

This website has an interesting, and apparently up-to-date set ofstatistics:


http://w3techs.com/technologies/overview/character_encoding/all

Their current top ten encodings, as of today, are:

UTF-8: 76.7%
ISO-8859-1: 11.7%
Windows-1251 (Cyrillic): 2.9%
GB2312 (Chinese): 2.5%
Shift JIS (Japanese): 1.5%
Windows-1252 (superset of ISO-8859-1): 1.4%
GBK (Chinese): 0.7%
ISO-8859-2 (Eastern Europe, Latin script): 0.4%
EUC-JP (Japanese): 0.4%
Windows-1256 (Arabic): 0.4%

Although the exact interpretation of these results is tricky, since theydon't give their criteria for exactly how they define and detect thesedecodings, if their results are even approximately right, it's prettyclear that UTF-8 now dominates the web as the single commonestcharset/encoding by far.


-- N.

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to