Re: Detection of unlabeled UTF-8

2013-09-11 Thread Jean-Marc Desperrier
Adam Roach a écrit : when you look at that document, tell me what you think the parenthetical phrase after the author's name is supposed to look like -- because I can guarantee that Firefox isn't doing the right thing here. In my case it does and displays : Хизер Фланаган I have the universal c

Re: Detection of unlabeled UTF-8

2013-09-10 Thread Neil
And then you get sites that send ISO-8859-1 but the server is configured to send UTF-8 in the headers, e.g. http://darwinawards.com/darwin/darwin1999-38.html -- Warning: May contain traces of nuts. ___ dev-platform mailing list dev-platform@lists.mozi

Re: Detection of unlabeled UTF-8

2013-09-09 Thread Adam Roach
On 9/9/13 02:31, Henri Sivonen wrote: We don't have telemetry for the question "How often are pages that are not labeled as UTF-8, UTF-16 or anything that maps to their replacement encoding according to the Encoding Standard and that contain non-ASCII bytes in fact valid UTF-8?" How rare would th

Re: Detection of unlabeled UTF-8

2013-09-09 Thread Henri Sivonen
On Fri, Sep 6, 2013 at 6:17 PM, Adam Roach wrote: > Sure. It's a much trickier problem (and, in any case, the UI is > necessarily more intrusive than what I'm suggesting). There's no good way > to explain the nuanced implications of security decisions in a way that is > both accessible to a lay u

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris
On 06/09/13 18:28, Boris Zbarsky wrote: On 9/6/13 1:11 PM, Neil Harris wrote: Presumably most of that XHTML is being generated by automated tools Presumably most of that "XHTML" are tag-soup pages which claim to have an XHTML doctype. The chance of them actually being valid XHTML is slim to

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Marcos Caceres
On Friday, September 6, 2013 at 5:36 PM, Neil Harris wrote: > On 06/09/13 16:34, Gervase Markham wrote: > > > > Data! Sounds like a plan. > > > > Or we could ask our friends at Google or some other search engine to run > > a version of our detector over their index and see how often it says >

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris
On 06/09/13 17:48, Marcos Caceres wrote: On Friday, September 6, 2013 at 5:36 PM, Neil Harris wrote: On 06/09/13 16:34, Gervase Markham wrote: Data! Sounds like a plan. Or we could ask our friends at Google or some other search engine to run a version of our detector over their index and se

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Robert Kaiser
Henri Sivonen schrieb: Considering what Aryeh said earlier in this thread, do you have a suggestion how to do that so that > [...] Hmm, do we have to treat the whole document as a consistent charset? Could we instead, if we don't know the charset, look at every rendered-as-text node/attribute

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Boris Zbarsky
On 9/6/13 1:11 PM, Neil Harris wrote: Presumably most of that XHTML is being generated by automated tools Presumably most of that "XHTML" are tag-soup pages which claim to have an XHTML doctype. The chance of them actually being valid XHTML is slim to none (though maybe higher than the chanc

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris
On 06/09/13 16:34, Gervase Markham wrote: Data! Sounds like a plan. Or we could ask our friends at Google or some other search engine to run a version of our detector over their index and see how often it says "UTF-8" when our normal algorithm would say something else. Gerv This website has an

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Neil Harris
On 06/09/13 16:45, Robert Kaiser wrote: Henri Sivonen schrieb: Considering what Aryeh said earlier in this thread, do you have a suggestion how to do that so that > [...] Hmm, do we have to treat the whole document as a consistent charset? Could we instead, if we don't know the charset, look

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Adam Roach
On 9/6/13 04:25, Henri Sivonen wrote: We do surface such UI for https deployment errors inspiring academic papers about how bad it is that users are exposed to such UI. Sure. It's a much trickier problem (and, in any case, the UI is necessarily more intrusive than what I'm suggesting). There'

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Gervase Markham
On 06/09/13 16:17, Adam Roach wrote: > To the first point: the increase in complexity is fairly minimal for a > substantial gain in usability. Absent hard statistics, I suspect we will > disagree about how "fringe" this particular exception is. Suffice it to > say that I have personally encountered

Re: Detection of unlabeled UTF-8

2013-09-06 Thread Henri Sivonen
On Thu, Sep 5, 2013 at 7:32 PM, Mike Hoye wrote: > On 2013-09-05 10:10 AM, Henri Sivonen wrote: >> >> It's worth noting that for other classes of authoring errors (except for >> errors in https deployment) we don't give the user the tools to remedy >> authoring errors. > > Firefox silently remedie

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Boris Zbarsky
On 9/5/13 11:15 AM, Adam Roach wrote: I would argue that we do, to some degree, already do this for things like Content-Encoding. For example, if a website attempts to send gzip-encoded bodies without a Content-Encoding header, we don't simply display the compressed body as if it were encoded acc

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Robert Kaiser
Zack Weinberg schrieb: It is possible to distinguish UTF-8 from most legacy encodings heuristically with high reliability, and I'd like to suggest that we ought to do so, independent of locale. I would very much agree with doing that. UTF-8 is what is being suggested everywhere as the encoding

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Mike Hoye
On 2013-09-05 10:10 AM, Henri Sivonen wrote: It's worth noting that for other classes of authoring errors (except for errors in https deployment) we don't give the user the tools to remedy authoring errors. Firefox silently remedies all kinds authoring errors. - mhoye

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Adam Roach
On 9/5/13 09:10, Henri Sivonen wrote: Why should we surface this class of authoring error to the UI in a way that asks the user to make a decision considering how rare this class of authoring error is? It's not a matter of the user judging the rarity of the condition; it's the user being abl

Re: Detection of unlabeled UTF-8

2013-09-05 Thread Henri Sivonen
On Fri, Aug 30, 2013 at 6:17 PM, Adam Roach wrote: > > It seems to me that there's an important balance here between (a) letting > developers discover their configuration error and (b) allowing users to > render misconfigured content without specialized knowledge. It's worth noting that for oth

Re: Detection of unlabeled UTF-8

2013-09-04 Thread Adam Roach
On 9/2/13 13:36, Joshua Cranmer 🐧 wrote: I don't think there *is* a sane approach that satisfies everybody. Either you break "UTF8-just-works-everywhere", you break legacy content, you make parsing take inordinate times... I want to push on this last point a bit. Using a straightforward UTF-8

Re: Detection of unlabeled UTF-8

2013-09-02 Thread Joshua Cranmer 🐧
On 8/30/2013 1:41 PM, Anne van Kesteren wrote: On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧 wrote: The problem I have with this approach is that it assumes that the page is authored by someone who definitively knows the charset, which is not a scenario which universally holds. Suppose you

Re: Detection of unlabeled UTF-8

2013-09-02 Thread Anne van Kesteren
On Fri, Aug 30, 2013 at 8:36 PM, Adam Roach wrote: > On 8/30/13 13:41, Anne van Kesteren wrote: >> Where did the text file come from? There's a source somewhere... And >> these days that's hardly how people create content anyway. > > Maybe not for the content _you_ consume, but the Internet is a b

Re: Detection of unlabeled UTF-8

2013-08-31 Thread Neil
Mike Hoye wrote: On 2013-08-30 3:17 PM, Adam Roach wrote: On 8/30/13 14:11, Adam Roach wrote: ...helping the user understand why the headline they're trying to read renders as "Ð' Ð"оÑ?дÑfме пÑEURедложили оÑ,обÑEURаÑ,ÑOE "Ð?обелÑ?" Ñf Ðz(бамÑ< " rather than "? ?

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach
On 8/30/13 13:41, Anne van Kesteren wrote: Where did the text file come from? There's a source somewhere... And these days that's hardly how people create content anyway. Maybe not for the content _you_ consume, but the Internet is a bit larger than our ivory tower. Check out, for example:

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Mike Hoye
On 2013-08-30 3:17 PM, Adam Roach wrote: On 8/30/13 14:11, Adam Roach wrote: ...helping the user understand why the headline they're trying to read renders as "Ð' Ð"оÑ?дÑfме пÑEURедложили оÑ,обÑEURаÑ,ÑOE "Ð?обелÑ?" Ñf Ðz(бамÑ< " rather than "? ??? ?? ???

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach
On 8/30/13 12:24, Mike Hoye wrote: On 2013-08-30 11:17 AM, Adam Roach wrote: It seems to me that there's an important balance here between (a) letting developers discover their configuration error and (b) allowing users to render misconfigured content without specialized knowledge. For what

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach
On 8/30/13 14:11, Adam Roach wrote: ...helping the user understand why the headline they're trying to read renders as "Ð' Ð"оÑ?дÑfме пÑEURедложили оÑ,обÑEURаÑ,ÑOE "Ð?обелÑ?" Ñf Ðz(бамÑ< " rather than "? ??? ?? "??" ? ?". Well, *there's* a h

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Anne van Kesteren
On Fri, Aug 30, 2013 at 7:33 PM, Joshua Cranmer 🐧 wrote: > The problem I have with this approach is that it assumes that the page is > authored by someone who definitively knows the charset, which is not a > scenario which universally holds. Suppose you have a page that serves up the > contents of

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Joshua Cranmer 🐧
On 8/30/2013 4:01 AM, Anne van Kesteren wrote: On Fri, Aug 30, 2013 at 9:40 AM, Gervase Markham wrote: We don't want people to try and move to UTF-8, but move back because they haven't figured out how (or are technically unable) to label it correctly and "it comes out all wrong". You also don'

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Anne van Kesteren
On Fri, Aug 30, 2013 at 6:31 PM, Chris Peterson wrote: > Is there a less error-prone default we can recommend to Linux distribution > packagers? Maybe we can squelch the problem upstream instead of adding > browser hacks. The number of web server and distro packagers we would need > to reach out t

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Chris Peterson
On 8/30/13 3:03 AM, Henri Sivonen wrote: Telemetry data suggests that these days the more common reason for seeing mojibake is that there is an encoding declaration but it is wrong. My guess is that this arises from Linux distributions silently changing their Apache defaults to send a charset pa

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Mike Hoye
On 2013-08-30 11:17 AM, Adam Roach wrote: It seems to me that there's an important balance here between (a) letting developers discover their configuration error and (b) allowing users to render misconfigured content without specialized knowledge. For what it's worth Internet Explorer handled

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Adam Roach
On 8/30/13 05:08, Nicholas Nethercote wrote: On Fri, Aug 30, 2013 at 8:03 PM, Henri Sivonen wrote: I think we should encourage Web authors to use UTF-8 *and* to *declare* it. I'm no expert on this stuff, but Henri's point sure sound sensible to me. It seems to me that there's an important

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Henri Sivonen
On Fri, Aug 30, 2013 at 4:31 PM, Aryeh Gregor wrote: > In particular, you need to decide on the encoding before you start > running any user script, because you don't want document.characterSet > etc. to change once it might have already been accessed. For > performance reasons, we want to be abl

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Aryeh Gregor
On Fri, Aug 30, 2013 at 1:03 PM, Henri Sivonen wrote: > This is true if you run the heuristic over the entire byte stream. > Unfortunately, since we support incremental loading of HTML (and will > have to continue to do so), we don't have the entire byte stream > available at the time when we nee

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Nicholas Nethercote
On Fri, Aug 30, 2013 at 8:03 PM, Henri Sivonen wrote: > > I think we should encourage Web authors to use UTF-8 *and* to *declare* it. I'm no expert on this stuff, but Henri's point sure sound sensible to me. Nick ___ dev-platform mailing list dev-plat

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Henri Sivonen
On Thu, Aug 29, 2013 at 9:41 PM, Zack Weinberg wrote: > All the discussion of fallback character encodings has reminded me of an > issue I've been meaning to bring up for some time: As a user of the en-US > localization, nowadays the overwhelmingly most common situation where I see > mojibake is w

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Anne van Kesteren
On Fri, Aug 30, 2013 at 9:40 AM, Gervase Markham wrote: > We don't want people to try and move to UTF-8, but move back because > they haven't figured out how (or are technically unable) to label it > correctly and "it comes out all wrong". You also don't want it to be wrong half of the time. Give

Re: Detection of unlabeled UTF-8

2013-08-30 Thread Gervase Markham
On 29/08/13 19:41, Zack Weinberg wrote: > All the discussion of fallback character encodings has reminded me of an > issue I've been meaning to bring up for some time: As a user of the > en-US localization, nowadays the overwhelmingly most common situation > where I see mojibake is when a site puts

Re: Detection of unlabeled UTF-8

2013-08-29 Thread Anne van Kesteren
On Thu, Aug 29, 2013 at 7:41 PM, Zack Weinberg wrote: > If people are concerned about "infecting" the modern platform with > heuristics, perhaps we could limit application of the heuristic to quirks > mode, for HTML delivered over HTTP. I expect this would cover the majority > of the sites describ

Detection of unlabeled UTF-8

2013-08-29 Thread Zack Weinberg
All the discussion of fallback character encodings has reminded me of an issue I've been meaning to bring up for some time: As a user of the en-US localization, nowadays the overwhelmingly most common situation where I see mojibake is when a site puts UTF-8 in its pages without declaring any en