Re: Intent to prototype: Character encoding detector

Henri Sivonen Mon, 16 Dec 2019 08:52:47 -0800

On Mon, Dec 2, 2019 at 2:42 PM Henri Sivonen <hsivo...@mozilla.com> wrote:
> 1. On _unlabeled_ text/html and text/plain pages, autodetect _legacy_
> encoding, excluding UTF-8, for non-file: URLs and autodetect the
> encoding, including UTF-8, for file: URLs.
>
> Elevator pitch: Chrome already did this unilaterally. The motivation
> is to avoid a situation where a user switches to a Chromium-based as a
> result of browsing the legacy Web or local files.

Feature #1 is now on autoland.

> # Preference

For file: URLs, I ended up not putting the new detector behind a pref,
because the file: detection code is messy enough even without
alternative code paths, and I'm pretty confident that the new detector
is an improvement for our file: URL handling behavior.

For non-file: URLs, the new detector is overall controlled by
intl.charset.detector.ng.enabled, which defaults to true, i.e.
detector enabled. When the detector is enabled, various old
intl.charset.* are ignored in various ways.

The detector is, however, disabled by default for three TLDs: .jp,
.in, and .lk. This can be overridden via the prefs
intl.charset.detector.ng.jp.enabled,
intl.charset.detector.ng.in.enabled, and
intl.charset.detector.ng.lk.enabled all three of which default to
false. (These prefs cannot enable the detector if
intl.charset.detector.ng.enabled is false)

In the case of .jp, the pre-existing Japanese-specific detector is
used. This avoids regressing how soon we start reloading if we detect
EUC-JP.

The detector detects encodings that are actually part of the Web
Platform. However, this can cause problems when a site expects the
page to be decoded as windows-1252 _as a matter of undeclared
fallback_ and expects the user to have an _intentionally mis-encoded_
font that assigns non-Latin glyphs to the windows-1252 code points.
(Note that if the site says <meta charset=x-user-defined>, that
continues to be undisturbed:
https://searchfox.org/mozilla-central/rev/62a130ba0ac80f75175e4b65536290b52391f116/parser/html/nsHtml5StreamParser.cpp#1512
)

Chrome has detection for three windows-1252-misusing Devanagari font
encodings and nine Tamil ones. (Nine looks like a lot, but Python tool
in this space is documented to handle 25 Tamil legacy encodings!)
There is no indication that the Chrome developers found it necessary
to have these detections. Actively-maintained newspaper sites that,
according to old Bugzilla items, previously used these font hacks have
migrated to Unicode. Rather, it looks like Chrome inherited them from
Google search engine code. Still, this leaves the possibility that
there are sites that presently work (if the user has the appropriate
fonts installed) in Chrome thanks to this detection and in Firefox
thanks to Firefox mapping the .in TLD to windows-1252 and mapping .com
to windows-1252 in the English localizations as well as in the
localizations for the Brahmic-script languages of India.

By not enabling the new detector on .in at least for now avoids
disrupting sites that intentionally misuse windows-1252 without
declaring it if such sites are still used by users (at the expense of
out-of-locale usage of .in as a generic TLD; data disclosed by Google
as part of Chrome's detector suggest e.g. Japanese use of .in). To the
extent the phenomenon of relying on intentionally misencoded fonts
still exists but on .com, the new detector will likely disrupt it
(likely by guessing some Cyrillic encoding). However, I think it
doesn't make sense to let that possibility derail this whole
project/feature.

Although I believe this phenomenon to be mostly a Tamil in Tamil Nadu
thing rather than a general Tamil language thing, I disabled the
detector on .lk just in case to have more time to research the issue.

If reports of legacy Tamil sites breaking show up, please needinfo me
on Bugzilla.

I didn't disable the detector for .am, because Chrome doesn't appear
to have detections for Armenian intentional misuse of windows-1252.

If intl.charset.detector.ng.enabled is false, Japanese detection
behaves like previously, except that encoding inheritance from a
same-origin parent frame now takes precedence over the detector. (This
was a spec compliance bug that had previously gone unnoticed because
we hadn't run the full test suite with a detector enabled. It turns
out that tests both semi-intentionally and accidentally depend on
same-origin inheritance taking precedence as the spec says.)

In the interest of binary size, I removed the old Cyrillic detector at
the same time as landing the new one. If the new detector is disabled
by the old Cyrillic detector is enabled, the new detector runs in the
situations where the old Cyrillic detector would have run in a mode
that approximates the old Cyrillic detector. (This approximation can,
however, result in some non-Cyrillic outcomes that were impossible
with the old Cyrillic detector.)

> # web-platform-tests

I added tests as tentative WPTs.

--
Henri Sivonen
hsivo...@mozilla.com
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Intent to prototype: Character encoding detector

Reply via email to