Re: [whatwg] Spec comments, sections 1-2

2009-08-05 Thread Anne van Kesteren

On Wed, 05 Aug 2009 02:01:59 +0200, Ian Hickson i...@hixie.ch wrote:

I'm pretty sure that character encoding support in browsers is more of a
collect them all kind of thing than really based on content that
requires it, to be honest.


Really? I think a lot of them are actually used. If you know anything I'd  
love to trim the amount of encodings the Web needs to a smaller list than  
what we currently ship with. Ideally this becomes a fixed list across all  
Web languages.




If someone can provide a firm list of encodings that they are confident
are required for a certain substantial percentage of the Web, I'm happy  
to add the list to the spec.


Can you not do a survey on your large dataset of data to find this out? I  
read somewhere also that Adam Barth was able to add code to Google Chrome  
to figure out a better algorithm for Content-Type sniffing. Maybe  
something similar could be done here?



We've encountered problems by the way with using the Unicode encoding  
matching algorithm. Particularly on some Asian sites. I think we need to  
switch HTML5 back to something more akin to WebKit/Gecko/Trident. I  
realize this means more magic lists, but the current algorithm does not  
seem to cut it. E.g. sites rely on the fact that EUC_JP is not a  
recognized encoding but EUC-JP is.



--
Anne van Kesteren
http://annevankesteren.nl/


Re: [whatwg] Spec comments, sections 1-2

2009-08-04 Thread Ian Hickson
On Wed, 29 Jul 2009, Aryeh Gregor wrote:
 On Wed, Jul 29, 2009 at 4:39 AM, Ian Hicksoni...@hixie.ch wrote:
  
  Which others are needed for compatibility?
 
 I don't know, but there are certainly some.  Otherwise, why would 
 browsers support so many?

I'm pretty sure that character encoding support in browsers is more of a 
collect them all kind of thing than really based on content that 
requires it, to be honest.


 For instance, baidu.com is #9 on Alexa and serves gb2312 as far as I can 
 tell.  So does qq.com, which is #14. And sina.com.cn, #19.  
 vkontakte.ru is #30 and serves Windows-1251. tudou.com (#60) uses gbk.  
 rakuten.co.jp (#68) serves EUC-JP.
 
 This is just from a quick manual look at a few of the largest 
 non-English sites.  I'd think it would be fairly easy for someone (e.g., 
 Google) to come up with a rough summary of character encoding usage on 
 the web by percentage, and for vendors to say which encodings they 
 support, so a useful common list could be worked out.
 
 If browsers differ in which encodings they accept, that harms 
 interoperability, so I'd think it would be ideal if HTML 5 would specify 
 the exact list of encodings that must be supported and prohibited 
 support for any others.  The union of encodings supported by existing 
 browsers would be a reasonable start, since supporting a new encoding is 
 presumably pretty cheap.  Unless this is viewed as outside the scope of 
 HTML 5 -- e.g., if browsers tend to rely on the operating system for 
 encoding support.

If someone can provide a firm list of encodings that they are confident 
are required for a certain substantial percentage of the Web, I'm happy to 
add the list to the spec.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Spec comments, sections 1-2

2009-07-29 Thread Ian Hickson
On Wed, 15 Jul 2009, Aryeh Gregor wrote:

 In 2.4.4.1:
 
 If position is not past the end of input, return to the top of the
 step labeled loop in the overall algorithm (that's the step within
 which these substeps find themselves).
 
 Why not just go to step 9?

Which part specifically are you saying should be changed? I'm not sure I 
follow.


 In any event this is inconsistent with
 2.4.4.2, which says
 
 If position is not past the end of input, return to the top of step 9
 in the overall algorithm (that's the step within which these substeps
 find themselves).

I guess I didn't screw that one up yet. :-) I originally wrote algorithms 
using numeric step jumps, then when editing them I broke a bunch of jumps 
(adding steps but not updating numbers), and so whenever I edit an 
algorithm now, I make it symbolic rather than numeric.


 Either both should say the top of step 9 or both should say the top
 of the step labeled loop.

There is value in not changing them unless they are actually broken -- 
when I edit the spec, there's always a risk I'll break something.


 I don't see the value in the whole in the overall algorithm . . . 
 part, since in context there's no ambiguity with just giving the number.

For now -- what if I later add a dozen more substeps?


 If sign is positive, return value, otherwise return 0-value.
 
 I initially read 0-value as a single word, like p-value or
 whatever.  Perhaps it should have spaces to make it more immediately
 obvious that it's subtraction (0 - value).

I've rephrased this.


 In 2.6.2:
 
 The specification says that user agents may serve HTTPS content as 
 though it were unencrypted.  For instance, an example states: If a user 
 connects to a server with a self-signed certificate, the user agent 
 could allow the connection but just act as if there had been no 
 encryption.  If this is done, however, man-in-the-middle attacks become 
 trivial, unless the user is expected to notice the lack of encryption 
 (unlikely).
 
 For instance, suppose a user navigates to PayPal and bookmarks it. 
 PayPal is configured so if you try using HTTP (e.g., typing paypal.com 
 in the URL bar), it will redirect to HTTPS.  Therefore the user will 
 bookmark a URL such as https://www.paypal.com/.  Now suppose the user 
 later attempts to access this site from the bookmark with a MITM present 
 (e.g., a free wireless router placed in a public place by a malicious 
 person).
 
 The router can intercept the HTTPS request, make its own identical HTTPS 
 request, and return the results to the original HTTPS request, but 
 signed with its own key instead of the original.  If the user agent 
 behaves as described in the example, the only way for the user to notice 
 this is to notice that the URL bar looks different, or whatever visual 
 cue the browser uses.  If the user agent raises a prominent scary 
 warning or even makes it difficult for the user to continue, on the 
 other hand, there's no way for the attacker to prevent this, AFAIK.
 
 The section should prohibit user agents from displaying self-signed 
 pages without at least giving a warning.  Or, at a minimum, it should 
 strongly discourage it.  Currently it seems to indicate that this 
 behavior is acceptable.  As far as I know, existing browsers all present 
 scary warnings for self-signed pages (probably so scary as to be 
 misleading, in fact, but that's a separate issue).

I've required UAs to catch this case and added this example.


 In 2.7:
 
 User agents must at a minimum support the UTF-8 and Windows-1252
 encodings, but may support more.
 
 It is not unusual for Web browsers to support dozens if not upwards
 of a hundred distinct character encodings.
 
 Why aren't the most important ones listed as requirements?

They are. UTF-8 and 1252 are the most important ones.


 This seems to be contrary to the usual HTML 5 philosophy of mandating 
 (or at least precisely specifying) existing behavior that's required for 
 compatibility.

Which others are needed for compatibility?

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Spec comments, sections 1-2

2009-07-29 Thread Aryeh Gregor
On Wed, Jul 29, 2009 at 4:39 AM, Ian Hicksoni...@hixie.ch wrote:
 There is value in not changing them unless they are actually broken --
 when I edit the spec, there's always a risk I'll break something.

Okay, not a big deal then.

 I've required UAs to catch this case and added this example.

Okay, great.

 Which others are needed for compatibility?

I don't know, but there are certainly some.  Otherwise, why would
browsers support so many?  For instance, baidu.com is #9 on Alexa and
serves gb2312 as far as I can tell.  So does qq.com, which is #14.
And sina.com.cn, #19.  vkontakte.ru is #30 and serves Windows-1251.
tudou.com (#60) uses gbk.  rakuten.co.jp (#68) serves EUC-JP.

This is just from a quick manual look at a few of the largest
non-English sites.  I'd think it would be fairly easy for someone
(e.g., Google) to come up with a rough summary of character encoding
usage on the web by percentage, and for vendors to say which encodings
they support, so a useful common list could be worked out.

If browsers differ in which encodings they accept, that harms
interoperability, so I'd think it would be ideal if HTML 5 would
specify the exact list of encodings that must be supported and
prohibited support for any others.  The union of encodings supported
by existing browsers would be a reasonable start, since supporting a
new encoding is presumably pretty cheap.  Unless this is viewed as
outside the scope of HTML 5 -- e.g., if browsers tend to rely on the
operating system for encoding support.


[whatwg] Spec comments, sections 1-2

2009-07-15 Thread Aryeh Gregor
In 2.4.4.1:

If position is not past the end of input, return to the top of the
step labeled loop in the overall algorithm (that's the step within
which these substeps find themselves).

Why not just go to step 9?  In any event this is inconsistent with
2.4.4.2, which says

If position is not past the end of input, return to the top of step 9
in the overall algorithm (that's the step within which these substeps
find themselves).

Either both should say the top of step 9 or both should say the top
of the step labeled loop.  I don't see the value in the whole in the
overall algorithm . . . part, since in context there's no ambiguity
with just giving the number.

If sign is positive, return value, otherwise return 0-value.

I initially read 0-value as a single word, like p-value or
whatever.  Perhaps it should have spaces to make it more immediately
obvious that it's subtraction (0 - value).

In 2.6.2:

The specification says that user agents may serve HTTPS content as
though it were unencrypted.  For instance, an example states: If a
user connects to a server with a self-signed certificate, the user
agent could allow the connection but just act as if there had been no
encryption.  If this is done, however, man-in-the-middle attacks
become trivial, unless the user is expected to notice the lack of
encryption (unlikely).

For instance, suppose a user navigates to PayPal and bookmarks it.
PayPal is configured so if you try using HTTP (e.g., typing
paypal.com in the URL bar), it will redirect to HTTPS.  Therefore
the user will bookmark a URL such as https://www.paypal.com/.  Now
suppose the user later attempts to access this site from the bookmark
with a MITM present (e.g., a free wireless router placed in a public
place by a malicious person).

The router can intercept the HTTPS request, make its own identical
HTTPS request, and return the results to the original HTTPS request,
but signed with its own key instead of the original.  If the user
agent behaves as described in the example, the only way for the user
to notice this is to notice that the URL bar looks different, or
whatever visual cue the browser uses.  If the user agent raises a
prominent scary warning or even makes it difficult for the user to
continue, on the other hand, there's no way for the attacker to
prevent this, AFAIK.

The section should prohibit user agents from displaying self-signed
pages without at least giving a warning.  Or, at a minimum, it should
strongly discourage it.  Currently it seems to indicate that this
behavior is acceptable.  As far as I know, existing browsers all
present scary warnings for self-signed pages (probably so scary as to
be misleading, in fact, but that's a separate issue).

In 2.7:

User agents must at a minimum support the UTF-8 and Windows-1252
encodings, but may support more.

It is not unusual for Web browsers to support dozens if not upwards
of a hundred distinct character encodings.

Why aren't the most important ones listed as requirements?  This seems
to be contrary to the usual HTML 5 philosophy of mandating (or at
least precisely specifying) existing behavior that's required for
compatibility.