Re: [whatwg] Content type sniffing

2009-01-12 Thread Adam Barth
I should say that these figures are weighted by the number of page
loads, so if sniffing for a particular tag is needed for the digg.com
home page, it will show up as a large number.  If you don't weight by
traffic, you get similar results, but with slightly different numbers.

Adam


On Sun, Jan 11, 2009 at 11:54 PM, Adam Barth wha...@adambarth.com wrote:
 On Sun, Jan 11, 2009 at 6:41 PM, Boris Zbarsky bzbar...@mit.edu wrote:
 I just noticed that section 2.7.1 of HTML5 says:

  Extensions must not be used for determining resource types
  for resources fetched over HTTP.

 Extensions are bad news for content sniffing because they can often be
 chosen by the attacker.  For example, suppose user-uploaded content is
 can be downloaded at:

 http://example.com/download.php

 In most PHP configurations, the attacker can choose whatever file
 extension he likes by directing the user's browser to:

 http://example.com/download.php/whatever.foo

 And the PHP script will happily run.

 Now this use case (no content-type at all) was pretty common when the
 unknown type sniffer in Gecko was written, but that was years ago.  Do we
 have any data on how common it is now?

 Yes.  We do have lots of data from opt-in user metrics from Chrome.
 Here is a somewhat recent summary:

 https://crypto.stanford.edu/~abarth/research/html5/content-sniffing/

 To address your particular concern, body occurs 6899 times less often
 than script on Web content that lacks a Content-Type (or has an bogus
 Content-Type like */*), assuming I did my arithmetic correctly.

 P.S.  Of course at the moment the sniffer in Gecko is used for more than
 just HTTP, and it looks like we'll need separate modes for things like HTTP
 and things like file://.  I can live with that, though.  For the file://
 case detection of HTML in documents with no doctype/html/head is a must.

 I'm sympathetic to adding more HTML tags to the list, but I'm not sure
 how far down the tail we should go.  In Chrome, we went for 99.999%
 compatibility, which might be a bit far down the tail.  You can see
 the algorithm here:

 http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup

 Using that figure, we went down to p (which is two tags less common
 than body).

 Adam



Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor)

2009-01-12 Thread Julian Reschke

Martin Atkins wrote:

...
If it is true that RDFa can work today with no ill-effect in downlevel 
user-agents, what's currently blocking its implementation? Concern for 
validation?


It seems to me that many HTML extensions are implemented first and 
specified later[1], so perhaps it would be in the interests of RDFa 
proponents to get some implementations out there and get RDFa adopted, 
at which point it will hopefully seem a much more useful proposition for 
inclusion in HTML5.


In the short term the RDFa community can presumably provide a 
specialized HTML5 + RDFa validator for adopters to use until RDFa is 
incorporated into the core spec and tools.


It would seem that it's much easier to get into the spec when your 
feature is proven to be useful by real-world adoption.

...


What he said.

Although I *do* believe that in the end we'll want RDFa-in-HTML5, what's 
really important right now is *not* RDFa-in-HTML5 but RDFa-in-HTML4. 
Define that, make it a success, and the rest will be simple.


Best regards, Julian


[whatwg] data-* [Was:Re: Trying to work out the problems solved by RDFa]

2009-01-12 Thread James Graham

Benjamin Hawkes-Lewis wrote:

On 11/1/09 16:52, Calogero Alex Baldacchino wrote:


Well, that's a chance, of course, but that's *not* RDFa as specified by
W3C; for instance, @property is specified as accepting _only_ CURIEs


Good point; I hadn't spotted that.


It's the same with every possible existing custom (non-standard)
attributes and elements out there, since there is no standard for them,
and instead data-* has been created;


Emphatically, data-* has been created for private use data encoding 
(basically for scripting purposes) not as a replacement for the existing 
practices of adding new elements and attributes to HTML without going 
through W3C/WHATWG.


It should, perhaps set alarm bells ringing that almost every time data-* 
attributes come up, people suggest using them to publish data to the web 
at large rather than as internal scripting hooks. Since the restrictions 
on data-* are not machine checkable, even the majority of standards 
aware authors are unlikely to heed them. Therefore the net effect of 
the restriction will be to prevent conscientious standards bodies from 
using data-* attributes in their specifications. It is quite possible 
that popular technologies will arise from sources other than such 
standards organisations and so use of data-* for more than just private 
scripting may be inevitable.


It is also possible that features that start off as private scripting 
hooks will evolve into data publishing features. This again would lead 
to the natural breaking of the restriction of data-* attributes.


(I know I have said this before but I forget whether I posted it or just 
discussed it on IRC.)




Re: [whatwg] data-* [Was:Re: Trying to work out the problems solved by RDFa]

2009-01-12 Thread Julian Reschke

James Graham wrote:
It should, perhaps set alarm bells ringing that almost every time data-* 
attributes come up, people suggest using them to publish data to the web 
at large rather than as internal scripting hooks. Since the restrictions 
on data-* are not machine checkable, even the majority of standards 
aware authors are unlikely to heed them. Therefore the net effect of 
the restriction will be to prevent conscientious standards bodies from 
using data-* attributes in their specifications. It is quite possible 
that popular technologies will arise from sources other than such 
standards organisations and so use of data-* for more than just private 
scripting may be inevitable.


It is also possible that features that start off as private scripting 
hooks will evolve into data publishing features. This again would lead 
to the natural breaking of the restriction of data-* attributes.


(I know I have said this before but I forget whether I posted it or just 
discussed it on IRC.)


Agreed.

So what does this tell us about the point of view that distributed 
extensibility should not be supported by HTML5?


Best regards, Julian





Re: [whatwg] getElementsByClassName case sensitivity

2009-01-12 Thread Stewart Brodie
Ian Hickson i...@hixie.ch wrote (on 25 July 2008):

 I've made [getElementsByClassName] consistent with how classes work in CSS
 (case-insensitive for quirks and case-sensitive otherwise).

I was looking for some tests for this API and found some from Opera (found
at http://tc.labs.opera.com/apis/getElementsByClassName/) but given the
dates on them predate the latest spec changes (which causes some to fail
now), I was wondering if up to date versions are now kept somewhere else
instead?

-- 
Stewart Brodie
Software Engineer
ANT Software Limited


Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor)

2009-01-12 Thread Toby A Inkster

Martin Atkins wrote:


  * Some sites are already publishing XFN and/or hCard so consuming
software would need to continue to support these in addition to
FOAF-in-HTML-somehow, which is more work than supporting only XFN and
hCard.


Mitigating this though is GRDDL which allows the hCard+XFN to be  
parsed using a subset of FOAF (e.g. http://weborganics.co.uk/hFoaF/)  
and thus merged with FOAF available as RDF/XML, RDFa, etc.


--
Toby A Inkster
mailto:m...@tobyinkster.co.uk
http://tobyinkster.co.uk





Re: [whatwg] Content type sniffing

2009-01-12 Thread Boris Zbarsky

Adam Barth wrote:

Extensions are bad news for content sniffing because they can often be
chosen by the attacker.  For example, suppose user-uploaded content is
can be downloaded at:

http://example.com/download.php

In most PHP configurations, the attacker can choose whatever file
extension he likes by directing the user's browser to:

http://example.com/download.php/whatever.foo

And the PHP script will happily run.


Right, I understand that.


Yes.  We do have lots of data from opt-in user metrics from Chrome.
Here is a somewhat recent summary:

https://crypto.stanford.edu/~abarth/research/html5/content-sniffing/


I'm not quite sure what to make of this, actually.  Specifically, where 
is the 22.19% number for HTML Tags coming from?  22.19% of what? 
The magic numbers stuff actually adds up to 100%, but of what?



To address your particular concern, body occurs 6899 times less often
than script on Web content that lacks a Content-Type (or has an bogus
Content-Type like */*), assuming I did my arithmetic correctly.


OK, that's good to know.


I'm sympathetic to adding more HTML tags to the list, but I'm not sure
how far down the tail we should go.  In Chrome, we went for 99.999%
compatibility, which might be a bit far down the tail.


Doesn't seem that way to me, given the number of web pages out there.


http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup


Ah, ok.  The relevant Gecko code is 
http://hg.mozilla.org/mozilla-central/annotate/9f82199fdb9c/netwerk/streamconv/converters/nsUnknownDecoder.cpp#l477. 
I'd probably be fine with trimming that list down a bit, but I'm not 
quite sure what the downsides of having more tags in it are here.


-Boris


Re: [whatwg] Content type sniffing

2009-01-12 Thread Adam Barth
On Mon, Jan 12, 2009 at 7:54 AM, Boris Zbarsky bzbar...@mit.edu wrote:
 I'm not quite sure what to make of this, actually.  Specifically, where is
 the 22.19% number for HTML Tags coming from?  22.19% of what? The magic
 numbers stuff actually adds up to 100%, but of what?

Sorry, the % was confusing.  I've removed them.  These table are the
relative frequency with which those rules fire in the content sniffer.
 Probably should have scaled them all to be out of 100 or out of 1,
but it was more convenient to scale them out of the totals that I did.

 I'm sympathetic to adding more HTML tags to the list, but I'm not sure
 how far down the tail we should go.  In Chrome, we went for 99.999%
 compatibility, which might be a bit far down the tail.

 Doesn't seem that way to me, given the number of web pages out there.

I don't think it makes sense to compare that percentage to the number
of web pages.  Instead, imagine a user who views 100 pages a day.
That user will, in a crude average sense, come across a broken web
page once every three years.

 http://src.chromium.org/viewvc/chrome/trunk/src/net/base/mime_sniffer.cc?view=markup

 Ah, ok.  The relevant Gecko code is
 http://hg.mozilla.org/mozilla-central/annotate/9f82199fdb9c/netwerk/streamconv/converters/nsUnknownDecoder.cpp#l477.

Yes, I've examined that code in detail.  :)  Here is a web page that
will let you compare the sniffing algorithms used by four popular
browsers:

http://webblaze.cs.berkeley.edu/2009/content-sniffing/

 I'd probably be fine with trimming that list down a bit, but I'm not quite
 sure what the downsides of having more tags in it are here.

Most of the cost is complexity (which leads to security
vulnerabilities).  People who let users upload content and who build
firewalls that filter content at the application layer (for example,
to look for malware) need to understand browser content sniffing
algorithms in order to build secure products.  There is a huge
complexity win for standardizing the algorithm across multiple
implementations, and there is a small complexity loss for each
sniffing heuristic we add.

One plan for going forward is to resolve
https://bugzilla.mozilla.org/show_bug.cgi?id=465007 and then open
another bug for harmonizing the HTML heuristic (with the expectation
that harmonization will probably involve changing both the spec and
the implementation).

Adam


Re: [whatwg] Fuzzbot (Firefox RDFa semantics processor)

2009-01-12 Thread Henri Sivonen

On Jan 11, 2009, at 14:01, Toby A Inkster wrote:

RDFa *does not* rely on XML namespaces. RDFa relies on eight  
attributes: about, rel, rev, property, datatype, content, resource  
and typeof. It also relies on a CURIE prefix binding mechanism. In  
XHTML and SVG, RDFa happens to use XML namespaces as this mechanism,  
because they already existed and they were convenient.


Convenience is debatable. In any case, it is rather disingenuous to  
say that RDFa doesn't rely on XML Namespaces when all that has been  
defined so far relies of attributes whose qname contains the substring  
xmlns.


In non-XML markup languages, the route to define CURIE prefixes is  
still to be decided, though discussions tend to be leaning towards  
something like:


html prefix=dc=http://purl.org/dc/terms/ foaf=http://xmlns.com/foaf/0.1/ 

address rel=foaf:maker rev=foaf:madeThis document was made by  
a href=http://joe.example.com; typeof=foaf:Person  
rel=foaf:homepage property=foaf:nameJoe Bloggs/a./address

/html


Unless this syntax were also used for XHTML, the above would be in  
violation of the DOM Consistency Design Principle of the W3C HTML WG.


This discussion seems to be about should/can RDFa work in HTML5?  
when in fact, RDFa already can and does work in HTML5 - there are  
approaching a dozen interoperable implementations of RDFa, the  
majority of which seem to handle non-XHTML HTML.


Those implementations violate the software implementation reuse  
principle that motivates the DOM Consistency Design Principle. (The  
software reuse principle being that the same code path be used for  
both HTML and XHTML on layers higher than the parser.)


The prefix mapping mechanism of CURIEs has been designed with  
disregard towards this software reuse principle (in use in Gecko,  
WebKit and, I gather, Presto) that should have been known to anyone  
working on Web-related specs far before DOM Consistency was written  
into the Design Principles of the HTML WG.


--
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/




Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-12 Thread Henri Sivonen

On Jan 11, 2009, at 18:52, Calogero Alex Baldacchino wrote:

However, actually it's the same for RDFa attributes, because they're  
not in the spec. From this point of view, introducing six new  
attributes, or resorting to an older one is not very different, thus  
(again) why RDFa and not eRDF?



eRDF is very different in not relying on attributes whose qname  
contains the substring xmlns.


--
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/




[whatwg] code in in body insertion mode (8.2.5)

2009-01-12 Thread Kartikaya Gupta
code is listed in the formatting category of elements, but isn't dealt with 
in the same way as other formatting elements when in the in body insertion 
mode. Currently it will fall through to the any other start tag case, so the 
note in that case that says This element will be a phrasing element is 
incorrect.

I'm assuming that the code element should be listed along with the other 
formatting elements (b, big, em, etc.) for the in body insertion mode. Is 
that correct?

kats


Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-12 Thread Calogero Alex Baldacchino

Benjamin Hawkes-Lewis ha scritto:



After all, support for unknown attributes/elements has never been a
standard de jure, but more of a quirk


Depends what you mean by support I guess.



I just mean that, as far as I know, there is no official standard 
requiring UAs to support (parse and expose through the DOM) attributes 
and elements which are not part of the HTML language but are found in 
text/html documents. Usually, browsers support them for robustness sake 
and/or backward compatibility with existing pages, but they might do it 
with significant differences (actually it happens for unknown elements 
but not for unknown attributes, but one shouldn't assume such common 
behavior might not change in the future, or that will be adopted by 
newer vendors (even if that might be a quite safe assumption), thus any 
hack to the language /for custom purposes and script elaboration/ should 
be done by the mean of existing attributes/elements instead of creating 
new ones (I mean, data-rdfa-about might be a bit safer than just 
about to use a conservative approach based on the assumption I know 
what happens today, not what will happen tomorrow) -- before data-* it 
was possible through the class attribute, now also data-* can be used 
for custom hacks)



I really don't see the problem if a *custom* convention became widely
accepted and reused by other people


Then you I think you don't agree with the fundamental design principle 
of the data-* attribute. The theory is that extensions to HTML 
benefit from going through a community process like WHATWG or W3C, and 
blessing extension points encourages people to circumvent that 
process, with the result that browsers have to support poorly designed 
features in order to have an interoperable web.




Yet it is *possible* to use data-* attributes to define a proper 
*private* convention by choosing names carefully in order to avoid 
clashes with other private conventions (for instance, a widget might 
need metadata to be put within the host page, and a careful choice of 
data-* names might avoid clashes with other metadata needed by other 
widgets or by the page itself). More people might find a certain 
convention useful and enough reusable for their purposes (because of 
non-clashing names), and the result would be a clearer caw path that 
community cawboys might follow to catch the free problem running away 
from standards.


The *only* difference with data-rdfa-* here would be that a higher 
number of authors/developers should agree with such a convention from 
the beginning, but only if they were interested in exchanging the same 
metadata with each others for their respective *custom* uses (through a 
custom script or plugin, either developed independently or shared). From 
this point of view, the only difference between data-rdfa-about and 
about - as used for the purposes of SearchMonkey - is that the former 
is immediately conforming to HTML5 spec and, thus, surely exposed 
through the DOM by every possible HTML5 compliant UA, as it happens for 
classes used by Microformats. I've never thought to any requirements for 
UAs not coming from a clearly traced caw path, the same way there is 
no requirement for UAs not involved in SearchMonkey to support any kind 
of metadata - for the purposes of SearchMonkey itself.


Unless one thinks that everyone facing a problem not solved (at all or 
enough for his purposes) by an official standard should either create a 
private hack disregarding any possible hacks for similar problems he 
might have happened to find on the web, or start a new community process 
eventually without knowing if other people are facing the same problem, 
or a similar one, I really can't understand why a *custom* and 
*born-private* (eventually within a group of authors/developers) and 
then become a widely accepted convention should be a problem, as far as 
it is based on existing, standard features and doesn't require any 
additional support and results in a possible cawpath to be then 
standardized as needed. And I really don't understand why class=xyz is 
a good hack whereas data-some-thing is not, assuming both are designed 
for and used by caws opening a path ( :-P )



I really can't get, right now, why it should be different, for instance,
from the case of a freely reusable widget using a custom data model
based on private data-* attributes inserted by people in thousands of
websites (the widget with relitive metadata, I mean), then liked by
other people and reused in different contexts (the same data model based
on data-*, now)


Reuse of data-* by DHTML widgets would not impose any additional 
requirements on user agents, so it would be fine from the perspective 
elaborated above. It wouldn't change the language by the back door.


Really? Is it so much different from the case of the pattern attribute 
(which addresses, at the UA and language level, a problem earlier solved 
by scripts -- e.g. getting elements by their 

Re: [whatwg] Trying to work out the problems solved by RDFa

2009-01-12 Thread Andi Sidwell

On 2009-01-12 23:15, Toby A Inkster wrote:

Henri Sivonen wrote:


eRDF is very different in not relying on attributes whose qname
contains the substring xmlns.



eRDF is very different in that it is incredibly annoying to use in real
world scenarios (i.e. not hypothetical Hello World examples).

Calogero Alex Baldacchino wrote:


I guess closing a language to every kind of back-door changes may be
in contrast with the principle of paving a cawpath. I also guess that,
if microformats experience (or the realworld semantics they claim to
be based on) had suggested the need to add a new element/attribute to
the language, a new element/attribute would have been added.


But Microformats experience *does* suggest that new attributes are
needed for semantics. Look at the debate around accessibility within
Microformats which has been going on for ages. Because of the
Microformats process of working *within* existing HTML standards it has
not been solved, and I can't see a solution reaching consensus in the
foreseeable future. HTML5's time goes part of the way to solving this,
but it doesn't address the whole problem like RDFa's content attribute
does.


Right, so some microformats brought to attention a need which HTML5 
could easily solve by adding time.  Why does this mean that RDFa 
should be added?



Another reason the Microformat experience suggests new attributes are
needed for semantics is the overloading of an attribute (class)
previously mainly used for private convention so that it is now used for
public consumption.


But HTML4 itself says that class can be used for general purpose 
processing by user agents, so this seems to be a weird argument.  If we 
introduced RDFa and it got used, would you argue you need something more 
than RDFa, because it is being used for what it is specced for?



Yes, in real life, there are pages that use
class=vcard for things other than encoding hCard. (They mostly use it
for linking to VCF files.) Incredibly, I've even come across pages that
use class=vcard for non-hCard uses, *and* hCard - yes, on the same
page! As the Microformat/POSHformat space becomes more crowded,
accidental collisions in class names become ever more likely.


Right, but is it much of an issue?  If you have a hCard extractor, the 
user can see easily that it's not useful data.  And if doesn't follow 
any of the other rules for an hCard, then the UA can safely ignore it 
(e.g. it has no fields).  In practice, this kind of collision seems 
fairly non-problematic.



The Microformats community hasn't added any new attributes for
Microformats, because that was one of the guiding principles when the
community was established: however, that does not mean it hasn't shown
that new attributes are needed for encoding rich semantics in HTML. On
the contrary, I think it's proved that they are.


Given that the only example of the microformats process needing an 
addition to the HTML language has been time, I'm not sure that's a 
conclusive proof.


Andi