Re: Detection of unlabeled UTF-8

Henri Sivonen Thu, 05 Sep 2013 07:18:16 -0700

On Fri, Aug 30, 2013 at 6:17 PM, Adam Roach <a...@mozilla.com> wrote:
>
> It seems to me that there's an important balance here between (a) letting 
> developers discover their configuration error and (b) allowing users to 
> render misconfigured content without specialized knowledge.

It's worth noting that for other classes of authoring errors  (except
for errors in https deployment) we  don't give the user the tools to
remedy authoring errors.

> Both of these are valid concerns, and I'm afraid that we're not assigning 
> enough weight to the user perspective.

Assigning weight to the *short-term* user perspective seems to be what
got  us into this mess in the first place. If Netscape had never had a
manual override for the character encoding or locale-specific
differences, user-exposed brokenness  would have quickly taught
authors to get their act encoding together--especially in the context
of languages like Japanese where a wrong encoding guess makes the page
completely unreadable.

(The obvious counter-argument is that  in the case of languages that
use a non-Latin, getting the  encoding  wrong is near the the YSoD
level of disaster and it's agreed that XML's error handling was a
mistake compared to HTML's. However, HTML's error handling surfaces no
UI choices to the user, works without having to reload the page and is
now well specified. Furthermore, even in the case of HTML, hindsight
says we'd be better off if no browser had tried to be too helpful
about fixing <i><b></i><b> in the first place.)

> I think we can find some middle ground here, where we help developers 
> discover their misconfiguration, while also handing users the tool they need 
> to fix it. Maybe an unobtrusive bar (similar to the password save bar) that 
> says something like: "This page's character encoding appears to be 
> mislabeled, which might cause certain characters to display incorrectly. 
> Would you like to reload this page as Unicode? [Yes] [No] [More Information] 
> [x]".

Why should we surface this class of authoring error to the UI in a way
that asks the user to make a decision considering how rare this class
of authoring error is? Are there other classes of authoring errors
that you think should have UI for the user to second-guess the author?
If yes, why? If not, why not?

That is, why is the case where text/html is in fact valid UTF-8 and
contains non-ASCII characters but has not been declared as UTF-8 so
special compared to other possible authoring errors that it should
have special treatment?

On Fri, Aug 30, 2013 at 8:24 PM, Mike Hoye <mh...@mozilla.com> wrote:
> For what it's worth Internet Explorer handled this (before UTF-8 and caring
> about JS performance were a thing) by guessing what encoding to use,
> comparing a letter-frequency-analysis of a page's content to a table of what
> bytes are most common in which in what encodings of whatever languages.

Is there evidence of IE doing this  in locales other than Japanese,
Russian and Ukrainian? Or even locales other than Japanese? Firefox
does this only for the Japanese, Russian and Ukrainian locales.

(FWIW, studying whether this is still needed for the Russian and
Ukrainian locales is
https://bugzilla.mozilla.org/show_bug.cgi?id=845791 .  As for
Japanese, some sort of detection magic is probably staying for the
foreseeable future. It appears that Microsoft fairly recently tried to
take ISO-2022-JP out of their detector for security reasons but had to
put it back for compatibility: http://support.microsoft.com/kb/2416400
http://support.microsoft.com/kb/2482017 )

> It's
> probably not a suitable approach in modernity, because of performance
> problems and horrible-though-rare edge cases.

See point #3 in https://bugzilla.mozilla.org/show_bug.cgi?id=910211#c2

On Fri, Aug 30, 2013 at 9:33 PM, Joshua Cranmer 🐧 <pidgeo...@gmail.com> wrote:
> The problem I have with this approach is that it assumes that the page is
> authored by someone who definitively knows the charset, which is not a
> scenario which universally holds. Suppose you have a page that serves up the
> contents of a plain text file, so your source data has no indication of its
> charset. What charset should the page report?

Your scenario assumes that the page template is ASCII-only. If it
isn't, browser-side guessing doesn't solve the problem. Even when the
template is ASCII-only, whoever wrote the inclusion on the server
probably has better contextual knowledge about what the encoding of
the input text could be then the browser has.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
http://hsivonen.iki.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to