Intent to prototype: Character encoding detector

Henri Sivonen Tue, 03 Dec 2019 07:00:01 -0800

# Summary

The template says this section should state the benefit to Web
developers. There is intentionally no benefit to Web developers. This
pair of features is meant to benefit users who encounter
badly-authored legacy pages, so that Firefox can retain users instead
of the users trying in Chrome on the new Edge. That is, this is about
the user experience of browsing the legacy long tail of the Web. This
is not about cool new stuff. In that sense, this feature is out of the
scope of "Intent to prototype" emails, but I'm sending one, because
this is a Web-visible feature in the sense that Web content could
detect its presence.


For newly-authored HTML pages, Web developers should use UTF-8 and
declare it (via UTF-8 BOM, <meta charset=utf-8>, or HTTP Content-Type:
text/html; charset=utf-8). The first and last option apply to
text/plain, too.

With that out of the way, there are two features contemplated here:

1. On _unlabeled_ text/html and text/plain pages, autodetect _legacy_
encoding, excluding UTF-8, for non-file: URLs and autodetect the
encoding, including UTF-8, for file: URLs.

Elevator pitch: Chrome already did this unilaterally. The motivation
is to avoid a situation where a user switches to a Chromium-based as a
result of browsing the legacy Web or local files.

As in Chrome, UTF-8 is deliberately excluded from possible detection
outcomes on non-file: URLs in order to avoid creating a situation
where the feature would have an unwanted effect on future Web
development by causing Web developers to rely on UTF-8 detection,
which would make the platform more brittle. That is, one type of
user-facing problem is deliberately left unfixed in order to avoid a
feedback loop into authoring that would generate more of the problem.
However, feature #2 below continues to allow users to address this
problem at the cost of taking an explicit menu action.

(Full discussion of the implications of detecting UTF-8 for HTML on
non-file: URLs needs a blog post, which I intend to write but which is
out of scope for this summary. Detecting UTF-8 for text/plain would be
less problematic, since there are no scripts and stylesheets that the
encoding would get inherited into and a reload wouldn't re-run any
script side effects, so I'm willing to entertain the idea of detecting
UTF-8 on non-file: text/plain, but it seems like a slippery slope.)

(Why now? Edge switching from the "like Safari" camp to the "like
Chrome" camp made it seem substantially less likely that everyone
would agree to get rid of guessing, so it no longer makes sense to
push for that outcome for the Web Platform. Also, now that UTF-8 has
clearly won for new Web development, this feature is likely to be less
harmful that it could have been in the past.)

2. Replace the Text Encoding submenu with a single menu item Override
Text Encoding, which forces the detector to run in a mode that ignores
the TLD hint and allows UTF-8 as an outcome.

(Disabled in the situations where the menu is presently disabled and
not taking effect in the situations where the menu presently does not
take effect. The menu is presently disabled if the top-level page is
in UTF-8 and valid, the top-level page started with a BOM, the
top-level page is UTF-16[BE|LE], or the top-level page is neither
text/html nor text/plain. The menu presently doesn't take effect if
the type of the page is neither text/html nor text/plain, the HTTP
layer declared UTF-16[BE|LE], or stream starts with a BOM. [As you can
see, the latter list is a subset of the former, so it should be
possible for the latter list to matter only for framed documents.])

Elevator pitch: Telemetry shows a) a substantial proportion of menu
use is for overriding _labeled_ pages and b) a substantial proportion
of menu use is to override an already overridden encoding suggesting
that users are bad at making a choice from the menu. Retaining a
user-invocable override continues to address the issue of mislabeled
content (which is presently addressed by Firefox and by desktop Safari
by providing the menu) while eliminating the need for the user to
figure out what to choose.

(Basically, feature #2 is easy to provide once feature #1 exists.)

# Bug

https://bugzilla.mozilla.org/show_bug.cgi?id=1551276

# Standard

The HTML Standard authorizes the existence of this kind of component
without specifying exactly how it should work.

Beyond that, there is no standard, but the implementation developed
here has deliberately been created in such a way that contributing the
data tables to the WHATWG and reversing the code into spec English
would be _possible_ if there's cross-vendor interest. In contrast, the
code in Chromium is a non-Chromium-originating over-the-wall dump of
mystery (lacking public design documentation as well as tooling for
regenerating the generated parts) C++ that even the Chrome developers
can't/won't change beyond making it compile with newer
compilers.(Furthermore, my implementation relies on the browser
already containing an implementation of the Encoding Standard. This
cuts the binary size impact to less than one fourth compared to
adopting the detector from Chrome, which doesn't benefit from any data
tables that a browser already has to have anyway.)

I've gone with demonstrating feasibility before further cross-vendor
discussion, because this is a user retention measure in response to a
unilateral move on Chrome's part and Safari on iOS doesn't face
pressure from switching to browsers with a different Web engine.

# Platform coverage

All platforms.

# Preference

There will probably be one for an initial testing period, but I
haven't picked a name yet.

# DevTools bug

There is no new DevTool surface for this. The HTML parser already
complains in a DevTool-visible way about unlabeled pages, and this
change will not remove those messages.

# Other browsers

Chromium-based browsers: Already shipping feature #1 (not shipping feature #2)

IE: Off-by-default (not precisely feature #1 or #2 but a kind of
combination of the two).

Safari: Not shipping either feature but, like Firefox and unlike
Chrome, provides a menu for addressing the use cases that feature #2
is meant to address.

# web-platform-tests

Since there isn't a spec and Safari doesn't implement the feature,
there are no cross-vendor tests.

# Secure contexts

Since this pair of features is about compatibility with legacy
content, both features apply to insecure contexts.

# Sandboxed iframes

The feature applies to sandboxed iframes.

For feature #1, the feature applies only to different-origin frames
and the situation is the same as for the pre-existing Japanese
detection: The framer cannot turn off the feature for the framee. Both
the framer or the framee can turn off the feature for itself by
adhering to the HTML authoring conformance requirements, i.e. by
declaring its own encoding.

For feature #2, the situation is the same as for the pre-existing
menu: The top-level page can turn off the feature for the whole
hierarchy by using UTF-8, not having any UTF-8 errors, and declaring
UTF-8, or, alternatively, by using the UTF-8 BOM (even if there are
subsequent errors). The framee can turn off the feature for itself by
using the UTF-8 BOM.

--
Henri Sivonen
hsivo...@mozilla.com
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Intent to prototype: Character encoding detector

Reply via email to