Re: [whatwg] Content type sniffing
I should say that these figures are weighted by the number of page loads, so if sniffing for a particular tag is needed for the digg.com home page, it will show up as a large number. If you don't weight by traffic, you get similar results, but with slightly different numbers. Adam On Sun, Jan 11, 2009 at 11:54 PM, Adam Barth wha...@adambarth.com wrote: On Sun, Jan 11, 2009 at 6:41 PM, Boris Zbarsky bzbar...@mit.edu wrote: I just noticed that section 2.7.1 of HTML5 says: Extensions must not be used for determining resource types for resources fetched over HTTP. Extensions are bad news for content sniffing because they can often be chosen by the attacker. For example, suppose user-uploaded content is can be downloaded at: http://example.com/download.php In most PHP configurations, the attacker can choose whatever file extension he likes by directing the user's browser to: http://example.com/download.php/whatever.foo And the PHP script will happily run. Now this use case (no content-type at all) was pretty common when the unknown type sniffer in Gecko was written, but that was years ago. Do we have any data on how common it is now? Yes. We do have lots of data from opt-in user metrics from Chrome. Here is a somewhat recent summary: https://crypto.stanford.edu/~abarth/research/html5/content-sniffing/ To address your particular concern, body occurs 6899 times less often than script on Web content that lacks a Content-Type (or has an bogus Content-Type like */*), assuming I did my arithmetic correctly. P.S. Of course at the moment the sniffer in Gecko is used for more than just HTTP, and it looks like we'll need separate modes for things like HTTP and things like file://. I can live with that, though. For the file:// case detection of HTML in documents with no doctype/html/head is a must. I'm sympathetic to adding more HTML tags to the list, but I'm not sure how far down the tail we should go. In Chrome, we went for 99.999% compatibility, which might be a bit far down the tail. You can see the algorithm here: http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup Using that figure, we went down to p (which is two tags less common than body). Adam
Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor)
Martin Atkins wrote: ... If it is true that RDFa can work today with no ill-effect in downlevel user-agents, what's currently blocking its implementation? Concern for validation? It seems to me that many HTML extensions are implemented first and specified later[1], so perhaps it would be in the interests of RDFa proponents to get some implementations out there and get RDFa adopted, at which point it will hopefully seem a much more useful proposition for inclusion in HTML5. In the short term the RDFa community can presumably provide a specialized HTML5 + RDFa validator for adopters to use until RDFa is incorporated into the core spec and tools. It would seem that it's much easier to get into the spec when your feature is proven to be useful by real-world adoption. ... What he said. Although I *do* believe that in the end we'll want RDFa-in-HTML5, what's really important right now is *not* RDFa-in-HTML5 but RDFa-in-HTML4. Define that, make it a success, and the rest will be simple. Best regards, Julian
[whatwg] data-* [Was:Re: Trying to work out the problems solved by RDFa]
Benjamin Hawkes-Lewis wrote: On 11/1/09 16:52, Calogero Alex Baldacchino wrote: Well, that's a chance, of course, but that's *not* RDFa as specified by W3C; for instance, @property is specified as accepting _only_ CURIEs Good point; I hadn't spotted that. It's the same with every possible existing custom (non-standard) attributes and elements out there, since there is no standard for them, and instead data-* has been created; Emphatically, data-* has been created for private use data encoding (basically for scripting purposes) not as a replacement for the existing practices of adding new elements and attributes to HTML without going through W3C/WHATWG. It should, perhaps set alarm bells ringing that almost every time data-* attributes come up, people suggest using them to publish data to the web at large rather than as internal scripting hooks. Since the restrictions on data-* are not machine checkable, even the majority of standards aware authors are unlikely to heed them. Therefore the net effect of the restriction will be to prevent conscientious standards bodies from using data-* attributes in their specifications. It is quite possible that popular technologies will arise from sources other than such standards organisations and so use of data-* for more than just private scripting may be inevitable. It is also possible that features that start off as private scripting hooks will evolve into data publishing features. This again would lead to the natural breaking of the restriction of data-* attributes. (I know I have said this before but I forget whether I posted it or just discussed it on IRC.)
Re: [whatwg] data-* [Was:Re: Trying to work out the problems solved by RDFa]
James Graham wrote: It should, perhaps set alarm bells ringing that almost every time data-* attributes come up, people suggest using them to publish data to the web at large rather than as internal scripting hooks. Since the restrictions on data-* are not machine checkable, even the majority of standards aware authors are unlikely to heed them. Therefore the net effect of the restriction will be to prevent conscientious standards bodies from using data-* attributes in their specifications. It is quite possible that popular technologies will arise from sources other than such standards organisations and so use of data-* for more than just private scripting may be inevitable. It is also possible that features that start off as private scripting hooks will evolve into data publishing features. This again would lead to the natural breaking of the restriction of data-* attributes. (I know I have said this before but I forget whether I posted it or just discussed it on IRC.) Agreed. So what does this tell us about the point of view that distributed extensibility should not be supported by HTML5? Best regards, Julian
Re: [whatwg] getElementsByClassName case sensitivity
Ian Hickson i...@hixie.ch wrote (on 25 July 2008): I've made [getElementsByClassName] consistent with how classes work in CSS (case-insensitive for quirks and case-sensitive otherwise). I was looking for some tests for this API and found some from Opera (found at http://tc.labs.opera.com/apis/getElementsByClassName/) but given the dates on them predate the latest spec changes (which causes some to fail now), I was wondering if up to date versions are now kept somewhere else instead? -- Stewart Brodie Software Engineer ANT Software Limited
Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor)
Martin Atkins wrote: * Some sites are already publishing XFN and/or hCard so consuming software would need to continue to support these in addition to FOAF-in-HTML-somehow, which is more work than supporting only XFN and hCard. Mitigating this though is GRDDL which allows the hCard+XFN to be parsed using a subset of FOAF (e.g. http://weborganics.co.uk/hFoaF/) and thus merged with FOAF available as RDF/XML, RDFa, etc. -- Toby A Inkster mailto:m...@tobyinkster.co.uk http://tobyinkster.co.uk
Re: [whatwg] Content type sniffing
Adam Barth wrote: Extensions are bad news for content sniffing because they can often be chosen by the attacker. For example, suppose user-uploaded content is can be downloaded at: http://example.com/download.php In most PHP configurations, the attacker can choose whatever file extension he likes by directing the user's browser to: http://example.com/download.php/whatever.foo And the PHP script will happily run. Right, I understand that. Yes. We do have lots of data from opt-in user metrics from Chrome. Here is a somewhat recent summary: https://crypto.stanford.edu/~abarth/research/html5/content-sniffing/ I'm not quite sure what to make of this, actually. Specifically, where is the 22.19% number for HTML Tags coming from? 22.19% of what? The magic numbers stuff actually adds up to 100%, but of what? To address your particular concern, body occurs 6899 times less often than script on Web content that lacks a Content-Type (or has an bogus Content-Type like */*), assuming I did my arithmetic correctly. OK, that's good to know. I'm sympathetic to adding more HTML tags to the list, but I'm not sure how far down the tail we should go. In Chrome, we went for 99.999% compatibility, which might be a bit far down the tail. Doesn't seem that way to me, given the number of web pages out there. http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup Ah, ok. The relevant Gecko code is http://hg.mozilla.org/mozilla-central/annotate/9f82199fdb9c/netwerk/streamconv/converters/nsUnknownDecoder.cpp#l477. I'd probably be fine with trimming that list down a bit, but I'm not quite sure what the downsides of having more tags in it are here. -Boris
Re: [whatwg] Content type sniffing
On Mon, Jan 12, 2009 at 7:54 AM, Boris Zbarsky bzbar...@mit.edu wrote: I'm not quite sure what to make of this, actually. Specifically, where is the 22.19% number for HTML Tags coming from? 22.19% of what? The magic numbers stuff actually adds up to 100%, but of what? Sorry, the % was confusing. I've removed them. These table are the relative frequency with which those rules fire in the content sniffer. Probably should have scaled them all to be out of 100 or out of 1, but it was more convenient to scale them out of the totals that I did. I'm sympathetic to adding more HTML tags to the list, but I'm not sure how far down the tail we should go. In Chrome, we went for 99.999% compatibility, which might be a bit far down the tail. Doesn't seem that way to me, given the number of web pages out there. I don't think it makes sense to compare that percentage to the number of web pages. Instead, imagine a user who views 100 pages a day. That user will, in a crude average sense, come across a broken web page once every three years. http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup Ah, ok. The relevant Gecko code is http://hg.mozilla.org/mozilla-central/annotate/9f82199fdb9c/netwerk/streamconv/converters/nsUnknownDecoder.cpp#l477. Yes, I've examined that code in detail. :) Here is a web page that will let you compare the sniffing algorithms used by four popular browsers: http://webblaze.cs.berkeley.edu/2009/content-sniffing/ I'd probably be fine with trimming that list down a bit, but I'm not quite sure what the downsides of having more tags in it are here. Most of the cost is complexity (which leads to security vulnerabilities). People who let users upload content and who build firewalls that filter content at the application layer (for example, to look for malware) need to understand browser content sniffing algorithms in order to build secure products. There is a huge complexity win for standardizing the algorithm across multiple implementations, and there is a small complexity loss for each sniffing heuristic we add. One plan for going forward is to resolve https://bugzilla.mozilla.org/show_bug.cgi?id=465007 and then open another bug for harmonizing the HTML heuristic (with the expectation that harmonization will probably involve changing both the spec and the implementation). Adam
Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor)
On Jan 11, 2009, at 14:01, Toby A Inkster wrote: RDFa *does not* rely on XML namespaces. RDFa relies on eight attributes: about, rel, rev, property, datatype, content, resource and typeof. It also relies on a CURIE prefix binding mechanism. In XHTML and SVG, RDFa happens to use XML namespaces as this mechanism, because they already existed and they were convenient. Convenience is debatable. In any case, it is rather disingenuous to say that RDFa doesn't rely on XML Namespaces when all that has been defined so far relies of attributes whose qname contains the substring xmlns. In non-XML markup languages, the route to define CURIE prefixes is still to be decided, though discussions tend to be leaning towards something like: html prefix=dc=http://purl.org/dc/terms/ foaf=http://xmlns.com/foaf/0.1/ address rel=foaf:maker rev=foaf:madeThis document was made by a href=http://joe.example.com; typeof=foaf:Person rel=foaf:homepage property=foaf:nameJoe Bloggs/a./address /html Unless this syntax were also used for XHTML, the above would be in violation of the DOM Consistency Design Principle of the W3C HTML WG. This discussion seems to be about should/can RDFa work in HTML5? when in fact, RDFa already can and does work in HTML5 - there are approaching a dozen interoperable implementations of RDFa, the majority of which seem to handle non-XHTML HTML. Those implementations violate the software implementation reuse principle that motivates the DOM Consistency Design Principle. (The software reuse principle being that the same code path be used for both HTML and XHTML on layers higher than the parser.) The prefix mapping mechanism of CURIEs has been designed with disregard towards this software reuse principle (in use in Gecko, WebKit and, I gather, Presto) that should have been known to anyone working on Web-related specs far before DOM Consistency was written into the Design Principles of the HTML WG. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
Re: [whatwg] Trying to work out the problems solved by RDFa
On Jan 11, 2009, at 18:52, Calogero Alex Baldacchino wrote: However, actually it's the same for RDFa attributes, because they're not in the spec. From this point of view, introducing six new attributes, or resorting to an older one is not very different, thus (again) why RDFa and not eRDF? eRDF is very different in not relying on attributes whose qname contains the substring xmlns. -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/
[whatwg] code in in body insertion mode (8.2.5)
code is listed in the formatting category of elements, but isn't dealt with in the same way as other formatting elements when in the in body insertion mode. Currently it will fall through to the any other start tag case, so the note in that case that says This element will be a phrasing element is incorrect. I'm assuming that the code element should be listed along with the other formatting elements (b, big, em, etc.) for the in body insertion mode. Is that correct? kats
Re: [whatwg] Trying to work out the problems solved by RDFa
Benjamin Hawkes-Lewis ha scritto: After all, support for unknown attributes/elements has never been a standard de jure, but more of a quirk Depends what you mean by support I guess. I just mean that, as far as I know, there is no official standard requiring UAs to support (parse and expose through the DOM) attributes and elements which are not part of the HTML language but are found in text/html documents. Usually, browsers support them for robustness sake and/or backward compatibility with existing pages, but they might do it with significant differences (actually it happens for unknown elements but not for unknown attributes, but one shouldn't assume such common behavior might not change in the future, or that will be adopted by newer vendors (even if that might be a quite safe assumption), thus any hack to the language /for custom purposes and script elaboration/ should be done by the mean of existing attributes/elements instead of creating new ones (I mean, data-rdfa-about might be a bit safer than just about to use a conservative approach based on the assumption I know what happens today, not what will happen tomorrow) -- before data-* it was possible through the class attribute, now also data-* can be used for custom hacks) I really don't see the problem if a *custom* convention became widely accepted and reused by other people Then you I think you don't agree with the fundamental design principle of the data-* attribute. The theory is that extensions to HTML benefit from going through a community process like WHATWG or W3C, and blessing extension points encourages people to circumvent that process, with the result that browsers have to support poorly designed features in order to have an interoperable web. Yet it is *possible* to use data-* attributes to define a proper *private* convention by choosing names carefully in order to avoid clashes with other private conventions (for instance, a widget might need metadata to be put within the host page, and a careful choice of data-* names might avoid clashes with other metadata needed by other widgets or by the page itself). More people might find a certain convention useful and enough reusable for their purposes (because of non-clashing names), and the result would be a clearer caw path that community cawboys might follow to catch the free problem running away from standards. The *only* difference with data-rdfa-* here would be that a higher number of authors/developers should agree with such a convention from the beginning, but only if they were interested in exchanging the same metadata with each others for their respective *custom* uses (through a custom script or plugin, either developed independently or shared). From this point of view, the only difference between data-rdfa-about and about - as used for the purposes of SearchMonkey - is that the former is immediately conforming to HTML5 spec and, thus, surely exposed through the DOM by every possible HTML5 compliant UA, as it happens for classes used by Microformats. I've never thought to any requirements for UAs not coming from a clearly traced caw path, the same way there is no requirement for UAs not involved in SearchMonkey to support any kind of metadata - for the purposes of SearchMonkey itself. Unless one thinks that everyone facing a problem not solved (at all or enough for his purposes) by an official standard should either create a private hack disregarding any possible hacks for similar problems he might have happened to find on the web, or start a new community process eventually without knowing if other people are facing the same problem, or a similar one, I really can't understand why a *custom* and *born-private* (eventually within a group of authors/developers) and then become a widely accepted convention should be a problem, as far as it is based on existing, standard features and doesn't require any additional support and results in a possible cawpath to be then standardized as needed. And I really don't understand why class=xyz is a good hack whereas data-some-thing is not, assuming both are designed for and used by caws opening a path ( :-P ) I really can't get, right now, why it should be different, for instance, from the case of a freely reusable widget using a custom data model based on private data-* attributes inserted by people in thousands of websites (the widget with relitive metadata, I mean), then liked by other people and reused in different contexts (the same data model based on data-*, now) Reuse of data-* by DHTML widgets would not impose any additional requirements on user agents, so it would be fine from the perspective elaborated above. It wouldn't change the language by the back door. Really? Is it so much different from the case of the pattern attribute (which addresses, at the UA and language level, a problem earlier solved by scripts -- e.g. getting elements by their
Re: [whatwg] Trying to work out the problems solved by RDFa
On 2009-01-12 23:15, Toby A Inkster wrote: Henri Sivonen wrote: eRDF is very different in not relying on attributes whose qname contains the substring xmlns. eRDF is very different in that it is incredibly annoying to use in real world scenarios (i.e. not hypothetical Hello World examples). Calogero Alex Baldacchino wrote: I guess closing a language to every kind of back-door changes may be in contrast with the principle of paving a cawpath. I also guess that, if microformats experience (or the realworld semantics they claim to be based on) had suggested the need to add a new element/attribute to the language, a new element/attribute would have been added. But Microformats experience *does* suggest that new attributes are needed for semantics. Look at the debate around accessibility within Microformats which has been going on for ages. Because of the Microformats process of working *within* existing HTML standards it has not been solved, and I can't see a solution reaching consensus in the foreseeable future. HTML5's time goes part of the way to solving this, but it doesn't address the whole problem like RDFa's content attribute does. Right, so some microformats brought to attention a need which HTML5 could easily solve by adding time. Why does this mean that RDFa should be added? Another reason the Microformat experience suggests new attributes are needed for semantics is the overloading of an attribute (class) previously mainly used for private convention so that it is now used for public consumption. But HTML4 itself says that class can be used for general purpose processing by user agents, so this seems to be a weird argument. If we introduced RDFa and it got used, would you argue you need something more than RDFa, because it is being used for what it is specced for? Yes, in real life, there are pages that use class=vcard for things other than encoding hCard. (They mostly use it for linking to VCF files.) Incredibly, I've even come across pages that use class=vcard for non-hCard uses, *and* hCard - yes, on the same page! As the Microformat/POSHformat space becomes more crowded, accidental collisions in class names become ever more likely. Right, but is it much of an issue? If you have a hCard extractor, the user can see easily that it's not useful data. And if doesn't follow any of the other rules for an hCard, then the UA can safely ignore it (e.g. it has no fields). In practice, this kind of collision seems fairly non-problematic. The Microformats community hasn't added any new attributes for Microformats, because that was one of the guiding principles when the community was established: however, that does not mean it hasn't shown that new attributes are needed for encoding rich semantics in HTML. On the contrary, I think it's proved that they are. Given that the only example of the microformats process needing an addition to the HTML language has been time, I'm not sure that's a conclusive proof. Andi