All the discussion of fallback character encodings has reminded me of an
issue I've been meaning to bring up for some time: As a user of the
en-US localization, nowadays the overwhelmingly most common situation
where I see mojibake is when a site puts UTF-8 in its pages without
declaring any encoding at all (neither via <meta charset> nor
Content-Type). It is possible to distinguish UTF-8 from most legacy
encodings heuristically with high reliability, and I'd like to suggest
that we ought to do so, independent of locale.
Having read through a bunch of the "fallback encoding is wrong" bugs
Henri's been filing, I have the impression that Henri would prefer we
*not* detect UTF-8, if only to limit the amount of 'magic' platform
behavior; however, I have three counterarguments for this:
1. There exist sites that still regularly add new, UTF-8-encoded
content, but whose *structure* was laid down in the late 1990s or early
2000s, declares no encoding, and is unlikely ever to be updated again.
The example I have to hand is
http://www.eyrie-productions.com/Forum/dcboard.cgi?az=read_count&om=138&forum=DCForumID24&viewmode=threaded
; many other posts on this forum have the same problem. Take note of the
vintage HTML. I suggested to the admins of this site that they add <meta
charset="utf-8"> to the master page template, and was told that no one
involved in current day-to-day operations has the necessary access
privileges. I suspect that this kind of situation is rather more common
than we would like to believe.
2. For some of the fallback-encoding-is-wrong bugs still open, a binary
UTF-8/unibyte heuristic would save the localization from having to
choose between displaying legacy minority-language content correctly and
displaying legacy hegemonic-language content correctly. If I understand
correctly, this is the case at least for Welsh:
https://bugzilla.mozilla.org/show_bug.cgi?id=844087 .
3. Files loaded from local disk have no encoding metadata from the
transport, and may have no in-band label either; in particular, UTF-8
plain text with no byte order mark, which is increasingly common, should
not be misidentified as the legacy encoding. Having a binary
UTF-8/unibyte heuristic might address some of the concerns mentioned in
the "File API should not use 'universal' character detection" bug,
https://bugzilla.mozilla.org/show_bug.cgi?id=848842 .
If people are concerned about "infecting" the modern platform with
heuristics, perhaps we could limit application of the heuristic to
quirks mode, for HTML delivered over HTTP. I expect this would cover the
majority of the sites described under point 1, and probably 2 as well.
zw
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform