All the discussion of fallback character encodings has reminded me of an issue I've been meaning to bring up for some time: As a user of the en-US localization, nowadays the overwhelmingly most common situation where I see mojibake is when a site puts UTF-8 in its pages without declaring any encoding at all (neither via <meta charset> nor Content-Type). It is possible to distinguish UTF-8 from most legacy encodings heuristically with high reliability, and I'd like to suggest that we ought to do so, independent of locale.

Having read through a bunch of the "fallback encoding is wrong" bugs Henri's been filing, I have the impression that Henri would prefer we *not* detect UTF-8, if only to limit the amount of 'magic' platform behavior; however, I have three counterarguments for this:

1. There exist sites that still regularly add new, UTF-8-encoded content, but whose *structure* was laid down in the late 1990s or early 2000s, declares no encoding, and is unlikely ever to be updated again. The example I have to hand is http://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded ; many other posts on this forum have the same problem. Take note of the vintage HTML. I suggested to the admins of this site that they add <meta charset="utf-8"> to the master page template, and was told that no one involved in current day-to-day operations has the necessary access privileges. I suspect that this kind of situation is rather more common than we would like to believe.

2. For some of the fallback-encoding-is-wrong bugs still open, a binary UTF-8/unibyte heuristic would save the localization from having to choose between displaying legacy minority-language content correctly and displaying legacy hegemonic-language content correctly. If I understand correctly, this is the case at least for Welsh: https://bugzilla.mozilla.org/show_bug.cgi?id=844087 .

3. Files loaded from local disk have no encoding metadata from the transport, and may have no in-band label either; in particular, UTF-8 plain text with no byte order mark, which is increasingly common, should not be misidentified as the legacy encoding. Having a binary UTF-8/unibyte heuristic might address some of the concerns mentioned in the "File API should not use 'universal' character detection" bug,
https://bugzilla.mozilla.org/show_bug.cgi?id=848842 .

If people are concerned about "infecting" the modern platform with heuristics, perhaps we could limit application of the heuristic to quirks mode, for HTML delivered over HTTP. I expect this would cover the majority of the sites described under point 1, and probably 2 as well.

zw
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to