Re: Content sniffing: seeking reliable protection of a text HTTP resource

Boris Zbarsky Sun, 04 Oct 2015 20:55:43 -0700

On 10/1/15 5:36 PM, Incnis Mrsi wrote:

First is “media type (a.k.a. MIME) sniffing”, when browser overrides
media type/subtype. This is implemented in
toolkit/components/mediasniffer/nsMediaSniffer.cpp component (and
possibly others, don’t know).

Note that these are generally very conservative in their application.There are very few cases in which we will override a server-providedMIME type for a document load, for example (I think the RSS thing mightwell be the only case, in fact).

There is a proposal
https://bugzilla.mozilla.org/show_bug.cgi?id=471020 to make behaviour of
Firefox compatible with MS Internet Explorer and
https://mimesniff.spec.whatwg.org/#supplied-mime-type-detection-algorithm ,
using «X-Content-Type-Options: nosniff» to switch the sniffing off.

Right, but that proposal needs to actually define what it means toswitch the sniffing off. I don't believe the IE behavior is documentedanywhere, unfortunately.

Second scenario is a less known “UTF sniffing”, applicable only to text
media types. Browser respects the type proper, but overrides «charset=»
label with own guesses.


Just to be clear, which situations are we talking about here?

For HTML, the behavior is defined inhttps://html.spec.whatwg.org/#determining-the-character-encoding andbasically says that a UTF-16 or UTF-8 BOM will override atransport-layer encoding declaration such as the "charset" bit in theContent-Type header.

For CSS, the behavior is defined inhttp://www.w3.org/TR/css3-syntax/#input-byte-stream which basically useshttps://encoding.spec.whatwg.org/#decode which once again looks at theBOM and only if one is missing considers other sources of encodinginformation (like HTTP headers).

For text/plain and other types that trigger "show as text" processing inthe browser, the relevant spec ishttps://html.spec.whatwg.org/#read-text which defers to thespecifications for the relevant MIME type. So arguably for text/css weshould consider the BOM before other things, while for text/plain weshould do what RFC 2046 defines for text/plain. Unfortunately, whatthat RFC defines is to use the "us-ascii" encoding if there is nocharset parameter supplied, which is not really a useful thing to do onthe web today. So I suspect that what we do in practice is exactly thesame thing as for HTML.

This is implemented in
netwerk/base/nsUnicharStreamLoader.cpp

There is no sniffing I see there. It just hands the initial bytes andnetwork headers to its consumers and asks them to pick an encoding.


In practice the only consumer is CSS, which is already discussed above.

HTML5 encoding sniffing that isn’t applicable (reasonably) to
text/plain.


Why not, if I might ask?

In the case of text/plain it leads to bugs. Simple test
cases are available at http://course.irccity.ru/ya-yu-9-amp.txt (toxic
UTF-16 “BOM”)

Why is this particularly a problem for text/plain but not text/html? Ifyour BOM doesn't match your text, you will have a bad time...

Notably, opening this file in a text editor will show U+2639, because aBOM is what a text editor has to go on.

and http://course.irccity.ru/p-guillemet-yi-ya.txt (toxic
UTF-8 “BOM”). It poses less immediate security risk


Indeed.

but still can cause
data corruption whenever arbitatry data are allowed into (beginning of)
text/plain documents.

True. On the other hand, if you're allowing arbitrary injection ofuntrusted content into your document you also have to worry aboutinjection of U+202E (RIGHT-TO-LEFT OVERRIDE) and other fun things, no?

The toxic UTF sniffing was observer in Firefox,
MSIE, Google Chrome, and Safari

Right, because I assume they all basically did the same thing: observedthat the spec for how encodings should be handled for text/plain is oldand direct application of it is daft on the web today, and since they'resending the data through the HTML parser _anyway_ just reused the HTMLparser's encoding codepath....

Possible approaches to the toxic UTF sniffing include:
• Just fix it (certainly would cause backlash from people eager to burn
anything except UTF-8).

Well, and is also likely to break some documents that are out there withbogus charset parameters right now.

I assume by "just fix it" you mean "reverse the precedence of BOM andthe charset parameter of Content-Type", not "follow RFC 2046 to theletter"? But if you mean the latter, please do say so, because then wecan just stop this conversation right now.

• Make a new Firefox preferences value   (e. g.
network.http.charset_quirk_level) controlling browser’s behaviour.


What's the point?  When would this ever be useful?

• Make patches for the source code to be used only by those who are
interested.


And of course:

• Say it's not a problem in any reasonable scenario

right?

-Boris
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: Content sniffing: seeking reliable protection of a text HTTP resource

Reply via email to