On 10/1/15 5:36 PM, Incnis Mrsi wrote:
First is “media type (a.k.a. MIME) sniffing”, when browser overrides
media type/subtype. This is implemented in
toolkit/components/mediasniffer/nsMediaSniffer.cpp component (and
possibly others, don’t know).

Note that these are generally very conservative in their application. There are very few cases in which we will override a server-provided MIME type for a document load, for example (I think the RSS thing might well be the only case, in fact).

There is a proposal
https://bugzilla.mozilla.org/show_bug.cgi?id=471020 to make behaviour of
Firefox compatible with MS Internet Explorer and
https://mimesniff.spec.whatwg.org/#supplied-mime-type-detection-algorithm ,
using «X-Content-Type-Options: nosniff» to switch the sniffing off.

Right, but that proposal needs to actually define what it means to switch the sniffing off. I don't believe the IE behavior is documented anywhere, unfortunately.

Second scenario is a less known “UTF sniffing”, applicable only to text
media types. Browser respects the type proper, but overrides «charset=»
label with own guesses.

Just to be clear, which situations are we talking about here?

For HTML, the behavior is defined in https://html.spec.whatwg.org/#determining-the-character-encoding and basically says that a UTF-16 or UTF-8 BOM will override a transport-layer encoding declaration such as the "charset" bit in the Content-Type header.

For CSS, the behavior is defined in http://www.w3.org/TR/css3-syntax/#input-byte-stream which basically uses https://encoding.spec.whatwg.org/#decode which once again looks at the BOM and only if one is missing considers other sources of encoding information (like HTTP headers).

For text/plain and other types that trigger "show as text" processing in the browser, the relevant spec is https://html.spec.whatwg.org/#read-text which defers to the specifications for the relevant MIME type. So arguably for text/css we should consider the BOM before other things, while for text/plain we should do what RFC 2046 defines for text/plain. Unfortunately, what that RFC defines is to use the "us-ascii" encoding if there is no charset parameter supplied, which is not really a useful thing to do on the web today. So I suspect that what we do in practice is exactly the same thing as for HTML.

This is implemented in
netwerk/base/nsUnicharStreamLoader.cpp

There is no sniffing I see there. It just hands the initial bytes and network headers to its consumers and asks them to pick an encoding.

In practice the only consumer is CSS, which is already discussed above.

HTML5 encoding sniffing that isn’t applicable (reasonably) to
text/plain.

Why not, if I might ask?

In the case of text/plain it leads to bugs. Simple test
cases are available at http://course.irccity.ru/ya-yu-9-amp.txt (toxic
UTF-16 “BOM”)

Why is this particularly a problem for text/plain but not text/html? If your BOM doesn't match your text, you will have a bad time...

Notably, opening this file in a text editor will show U+2639, because a BOM is what a text editor has to go on.

and http://course.irccity.ru/p-guillemet-yi-ya.txt (toxic
UTF-8 “BOM”). It poses less immediate security risk

Indeed.

but still can cause
data corruption whenever arbitatry data are allowed into (beginning of)
text/plain documents.

True. On the other hand, if you're allowing arbitrary injection of untrusted content into your document you also have to worry about injection of U+202E (RIGHT-TO-LEFT OVERRIDE) and other fun things, no?

The toxic UTF sniffing was observer in Firefox,
MSIE, Google Chrome, and Safari

Right, because I assume they all basically did the same thing: observed that the spec for how encodings should be handled for text/plain is old and direct application of it is daft on the web today, and since they're sending the data through the HTML parser _anyway_ just reused the HTML parser's encoding codepath....

Possible approaches to the toxic UTF sniffing include:
• Just fix it (certainly would cause backlash from people eager to burn
anything except UTF-8).

Well, and is also likely to break some documents that are out there with bogus charset parameters right now.

I assume by "just fix it" you mean "reverse the precedence of BOM and the charset parameter of Content-Type", not "follow RFC 2046 to the letter"? But if you mean the latter, please do say so, because then we can just stop this conversation right now.

• Make a new Firefox preferences value   (e. g.
network.http.charset_quirk_level) controlling browser’s behaviour.

What's the point?  When would this ever be useful?

• Make patches for the source code to be used only by those who are
interested.

And of course:

• Say it's not a problem in any reasonable scenario

right?

-Boris
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to