Detection of unlabeled UTF-8

Zack Weinberg Thu, 29 Aug 2013 11:45:44 -0700

All the discussion of fallback character encodings has reminded me of anissue I've been meaning to bring up for some time: As a user of theen-US localization, nowadays the overwhelmingly most common situationwhere I see mojibake is when a site puts UTF-8 in its pages withoutdeclaring any encoding at all (neither via <meta charset> norContent-Type). It is possible to distinguish UTF-8 from most legacyencodings heuristically with high reliability, and I'd like to suggestthat we ought to do so, independent of locale.

Having read through a bunch of the "fallback encoding is wrong" bugsHenri's been filing, I have the impression that Henri would prefer we*not* detect UTF-8, if only to limit the amount of 'magic' platformbehavior; however, I have three counterarguments for this:

1. There exist sites that still regularly add new, UTF-8-encodedcontent, but whose *structure* was laid down in the late 1990s or early2000s, declares no encoding, and is unlikely ever to be updated again.The example I have to hand ishttp://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded; many other posts on this forum have the same problem. Take note of thevintage HTML. I suggested to the admins of this site that they add <metacharset="utf-8"> to the master page template, and was told that no oneinvolved in current day-to-day operations has the necessary accessprivileges. I suspect that this kind of situation is rather more commonthan we would like to believe.

2. For some of the fallback-encoding-is-wrong bugs still open, a binaryUTF-8/unibyte heuristic would save the localization from having tochoose between displaying legacy minority-language content correctly anddisplaying legacy hegemonic-language content correctly. If I understandcorrectly, this is the case at least for Welsh:https://bugzilla.mozilla.org/show_bug.cgi?id=844087 .

3. Files loaded from local disk have no encoding metadata from thetransport, and may have no in-band label either; in particular, UTF-8plain text with no byte order mark, which is increasingly common, shouldnot be misidentified as the legacy encoding. Having a binaryUTF-8/unibyte heuristic might address some of the concerns mentioned inthe "File API should not use 'universal' character detection" bug,

https://bugzilla.mozilla.org/show_bug.cgi?id=848842 .

If people are concerned about "infecting" the modern platform withheuristics, perhaps we could limit application of the heuristic toquirks mode, for HTML delivered over HTTP. I expect this would cover themajority of the sites described under point 1, and probably 2 as well.


zw
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Detection of unlabeled UTF-8

Reply via email to