Re: Detection of unlabeled UTF-8

Adam Roach Fri, 30 Aug 2013 12:20:45 -0700

On 8/30/13 12:24, Mike Hoye wrote:

On 2013-08-30 11:17 AM, Adam Roach wrote:
It seems to me that there's an important balance here between (a)letting developers discover their configuration error and (b)allowing users to render misconfigured content without specializedknowledge.
For what it's worth Internet Explorer handled this (before UTF-8 andcaring about JS performance were a thing) by guessing what encoding touse, comparing a letter-frequency-analysis of a page's content to atable of what bytes are most common in which in what encodings ofwhatever languages.
...
From both the developer and user perspectives, it was amounted to"something went wrong because of bad magic."


I'd like to clarify two points about what I'm proposing.

First, I'm not proposing that we do anything without explicit userintervention, other than present an unobtrusive bar helping the userunderstand why the headline they're trying to read renders as "Ð'Ð"Ð¾Ñ?Ð´ÑfÐ¼Ðµ Ð¿ÑEURÐµÐ´Ð»Ð¾Ð¶Ð¸Ð»Ð¸ Ð¾Ñ,Ð¾Ð±ÑEURÐ°Ñ,ÑOE "Ð?Ð¾Ð±ÐµÐ»Ñ?"Ñf Ðz(Ð±Ð°Ð¼Ñ< " rather than "? ??????? ?????????? ???????? "??????" ??????". (No political statement intended here -- that's just the leadingheadline on Pravda at the moment).

If the user is happy with the encoding, they do nothing and go abouttheir business.

If the user determines that the rendering is, in fact, not what theywant, they can simply click on the "Yes" button and (with highprobability), everything is right with the world again.

Also note that I'm not proposing that we try to do generic character setand language detection. That's fraught with the perils you cite. Thetopic we're discussing here is UTF-8, which can be easily detected withextremely high confidence.


--
Adam Roach
Principal Platform Engineer
a...@mozilla.com
+1 650 903 0800 x863
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Detection of unlabeled UTF-8

Reply via email to