Re: [whatwg] Spec comments, sections 1-2
On Wed, 05 Aug 2009 02:01:59 +0200, Ian Hickson i...@hixie.ch wrote: I'm pretty sure that character encoding support in browsers is more of a collect them all kind of thing than really based on content that requires it, to be honest. Really? I think a lot of them are actually used. If you know anything I'd love to trim the amount of encodings the Web needs to a smaller list than what we currently ship with. Ideally this becomes a fixed list across all Web languages. If someone can provide a firm list of encodings that they are confident are required for a certain substantial percentage of the Web, I'm happy to add the list to the spec. Can you not do a survey on your large dataset of data to find this out? I read somewhere also that Adam Barth was able to add code to Google Chrome to figure out a better algorithm for Content-Type sniffing. Maybe something similar could be done here? We've encountered problems by the way with using the Unicode encoding matching algorithm. Particularly on some Asian sites. I think we need to switch HTML5 back to something more akin to WebKit/Gecko/Trident. I realize this means more magic lists, but the current algorithm does not seem to cut it. E.g. sites rely on the fact that EUC_JP is not a recognized encoding but EUC-JP is. -- Anne van Kesteren http://annevankesteren.nl/
Re: [whatwg] Spec comments, sections 1-2
On Wed, 29 Jul 2009, Aryeh Gregor wrote: On Wed, Jul 29, 2009 at 4:39 AM, Ian Hicksoni...@hixie.ch wrote: Which others are needed for compatibility? I don't know, but there are certainly some. Otherwise, why would browsers support so many? I'm pretty sure that character encoding support in browsers is more of a collect them all kind of thing than really based on content that requires it, to be honest. For instance, baidu.com is #9 on Alexa and serves gb2312 as far as I can tell. So does qq.com, which is #14. And sina.com.cn, #19. vkontakte.ru is #30 and serves Windows-1251. tudou.com (#60) uses gbk. rakuten.co.jp (#68) serves EUC-JP. This is just from a quick manual look at a few of the largest non-English sites. I'd think it would be fairly easy for someone (e.g., Google) to come up with a rough summary of character encoding usage on the web by percentage, and for vendors to say which encodings they support, so a useful common list could be worked out. If browsers differ in which encodings they accept, that harms interoperability, so I'd think it would be ideal if HTML 5 would specify the exact list of encodings that must be supported and prohibited support for any others. The union of encodings supported by existing browsers would be a reasonable start, since supporting a new encoding is presumably pretty cheap. Unless this is viewed as outside the scope of HTML 5 -- e.g., if browsers tend to rely on the operating system for encoding support. If someone can provide a firm list of encodings that they are confident are required for a certain substantial percentage of the Web, I'm happy to add the list to the spec. -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Spec comments, sections 1-2
On Wed, 15 Jul 2009, Aryeh Gregor wrote: In 2.4.4.1: If position is not past the end of input, return to the top of the step labeled loop in the overall algorithm (that's the step within which these substeps find themselves). Why not just go to step 9? Which part specifically are you saying should be changed? I'm not sure I follow. In any event this is inconsistent with 2.4.4.2, which says If position is not past the end of input, return to the top of step 9 in the overall algorithm (that's the step within which these substeps find themselves). I guess I didn't screw that one up yet. :-) I originally wrote algorithms using numeric step jumps, then when editing them I broke a bunch of jumps (adding steps but not updating numbers), and so whenever I edit an algorithm now, I make it symbolic rather than numeric. Either both should say the top of step 9 or both should say the top of the step labeled loop. There is value in not changing them unless they are actually broken -- when I edit the spec, there's always a risk I'll break something. I don't see the value in the whole in the overall algorithm . . . part, since in context there's no ambiguity with just giving the number. For now -- what if I later add a dozen more substeps? If sign is positive, return value, otherwise return 0-value. I initially read 0-value as a single word, like p-value or whatever. Perhaps it should have spaces to make it more immediately obvious that it's subtraction (0 - value). I've rephrased this. In 2.6.2: The specification says that user agents may serve HTTPS content as though it were unencrypted. For instance, an example states: If a user connects to a server with a self-signed certificate, the user agent could allow the connection but just act as if there had been no encryption. If this is done, however, man-in-the-middle attacks become trivial, unless the user is expected to notice the lack of encryption (unlikely). For instance, suppose a user navigates to PayPal and bookmarks it. PayPal is configured so if you try using HTTP (e.g., typing paypal.com in the URL bar), it will redirect to HTTPS. Therefore the user will bookmark a URL such as https://www.paypal.com/. Now suppose the user later attempts to access this site from the bookmark with a MITM present (e.g., a free wireless router placed in a public place by a malicious person). The router can intercept the HTTPS request, make its own identical HTTPS request, and return the results to the original HTTPS request, but signed with its own key instead of the original. If the user agent behaves as described in the example, the only way for the user to notice this is to notice that the URL bar looks different, or whatever visual cue the browser uses. If the user agent raises a prominent scary warning or even makes it difficult for the user to continue, on the other hand, there's no way for the attacker to prevent this, AFAIK. The section should prohibit user agents from displaying self-signed pages without at least giving a warning. Or, at a minimum, it should strongly discourage it. Currently it seems to indicate that this behavior is acceptable. As far as I know, existing browsers all present scary warnings for self-signed pages (probably so scary as to be misleading, in fact, but that's a separate issue). I've required UAs to catch this case and added this example. In 2.7: User agents must at a minimum support the UTF-8 and Windows-1252 encodings, but may support more. It is not unusual for Web browsers to support dozens if not upwards of a hundred distinct character encodings. Why aren't the most important ones listed as requirements? They are. UTF-8 and 1252 are the most important ones. This seems to be contrary to the usual HTML 5 philosophy of mandating (or at least precisely specifying) existing behavior that's required for compatibility. Which others are needed for compatibility? -- Ian Hickson U+1047E)\._.,--,'``.fL http://ln.hixie.ch/ U+263A/, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Re: [whatwg] Spec comments, sections 1-2
On Wed, Jul 29, 2009 at 4:39 AM, Ian Hicksoni...@hixie.ch wrote: There is value in not changing them unless they are actually broken -- when I edit the spec, there's always a risk I'll break something. Okay, not a big deal then. I've required UAs to catch this case and added this example. Okay, great. Which others are needed for compatibility? I don't know, but there are certainly some. Otherwise, why would browsers support so many? For instance, baidu.com is #9 on Alexa and serves gb2312 as far as I can tell. So does qq.com, which is #14. And sina.com.cn, #19. vkontakte.ru is #30 and serves Windows-1251. tudou.com (#60) uses gbk. rakuten.co.jp (#68) serves EUC-JP. This is just from a quick manual look at a few of the largest non-English sites. I'd think it would be fairly easy for someone (e.g., Google) to come up with a rough summary of character encoding usage on the web by percentage, and for vendors to say which encodings they support, so a useful common list could be worked out. If browsers differ in which encodings they accept, that harms interoperability, so I'd think it would be ideal if HTML 5 would specify the exact list of encodings that must be supported and prohibited support for any others. The union of encodings supported by existing browsers would be a reasonable start, since supporting a new encoding is presumably pretty cheap. Unless this is viewed as outside the scope of HTML 5 -- e.g., if browsers tend to rely on the operating system for encoding support.
[whatwg] Spec comments, sections 1-2
In 2.4.4.1: If position is not past the end of input, return to the top of the step labeled loop in the overall algorithm (that's the step within which these substeps find themselves). Why not just go to step 9? In any event this is inconsistent with 2.4.4.2, which says If position is not past the end of input, return to the top of step 9 in the overall algorithm (that's the step within which these substeps find themselves). Either both should say the top of step 9 or both should say the top of the step labeled loop. I don't see the value in the whole in the overall algorithm . . . part, since in context there's no ambiguity with just giving the number. If sign is positive, return value, otherwise return 0-value. I initially read 0-value as a single word, like p-value or whatever. Perhaps it should have spaces to make it more immediately obvious that it's subtraction (0 - value). In 2.6.2: The specification says that user agents may serve HTTPS content as though it were unencrypted. For instance, an example states: If a user connects to a server with a self-signed certificate, the user agent could allow the connection but just act as if there had been no encryption. If this is done, however, man-in-the-middle attacks become trivial, unless the user is expected to notice the lack of encryption (unlikely). For instance, suppose a user navigates to PayPal and bookmarks it. PayPal is configured so if you try using HTTP (e.g., typing paypal.com in the URL bar), it will redirect to HTTPS. Therefore the user will bookmark a URL such as https://www.paypal.com/. Now suppose the user later attempts to access this site from the bookmark with a MITM present (e.g., a free wireless router placed in a public place by a malicious person). The router can intercept the HTTPS request, make its own identical HTTPS request, and return the results to the original HTTPS request, but signed with its own key instead of the original. If the user agent behaves as described in the example, the only way for the user to notice this is to notice that the URL bar looks different, or whatever visual cue the browser uses. If the user agent raises a prominent scary warning or even makes it difficult for the user to continue, on the other hand, there's no way for the attacker to prevent this, AFAIK. The section should prohibit user agents from displaying self-signed pages without at least giving a warning. Or, at a minimum, it should strongly discourage it. Currently it seems to indicate that this behavior is acceptable. As far as I know, existing browsers all present scary warnings for self-signed pages (probably so scary as to be misleading, in fact, but that's a separate issue). In 2.7: User agents must at a minimum support the UTF-8 and Windows-1252 encodings, but may support more. It is not unusual for Web browsers to support dozens if not upwards of a hundred distinct character encodings. Why aren't the most important ones listed as requirements? This seems to be contrary to the usual HTML 5 philosophy of mandating (or at least precisely specifying) existing behavior that's required for compatibility.