[websec] When is sniffing heuristic?

Larry Masinter Sun, 08 Jan 2012 16:06:17 -0800

There are several different situations where sniffing is of necessity 
heuristic, because you are 'guessing' the intent of the content.
These are due to the fact that the set of possible valid Content-Type values 
does not partition the   space of possible bodies.


There may be other situations where sniffing is heuristic, but in these cases, 
sniffing is *necessarily* heuristic because there are multiple results which 
are valid, and knowing the right result requires additional information about 
the intent of the communication. The heuristic comes presumably from a manual 
examination of some web material where such information about intent is known, 
and projecting that the generalization applies to all such material and cases.

a) Specializations:

A file which is, for example, application/xhtml+xml is, of necessity, also a 
valid file of type application/xml. If you were to "sniff" some content that 
was valid application/xhtml+xml, you could also legitimately claim it was 
application/xml.
Most data types which are 'text' are also text/plain.
Every type is a subset of application/octet-stream.

There are numerable examples of this, and a large number of failure cases, 
e.g., zip-based packaging formats being sniffed as zip when the specialization 
isn't correctly recognized,  image/dng which is sniffed to be image/tiff, etc.


b) "Polyglot":

This is a situation where data is intentionally prepared to be interpretable as 
two different media types, possibly to be served and later processed as either, 
where the intention of the content is to behave similarly for ordinary 
processing, but amenable to specialized processing only defined for one or the 
other media type. The XHTML/HTML polyglot spec

http://dev.w3.org/html5/html-xhtml-author-guide/

is of course is the most relevant use case. The same content could be sniffed 
to be either type.  This is different from the specialization case because 
neither of the media types are subsets of the other.

c) "Multiview"

I don't know exactly what to call this, but it is the situation where the same 
content is valid as two different media types intentionally, the media types do 
not overlap but the treatment as the different types is intentionally 
different.  The use case for multiview I was looking at was one where the same 
content could be viewed as XHTML  (for a presentational view) and also as RDF 
(for a data point of view).

This is different from specialization (since the two types overlap but one is 
not a subset of the other), and polyglot (since the material is intended to 
have different meaning in its ordinary application).


_______________________________________________
websec mailing list
websec@ietf.org
https://www.ietf.org/mailman/listinfo/websec

[websec] When is sniffing heuristic?

Reply via email to