Re: [whatwg] Discrepancies between HTML and ES rules for parsing an integer or float

2012-01-23 Thread Ian Hickson
On Wed, 3 Aug 2011, Aryeh Gregor wrote:

 Hixie just WONTFIXed two bugs that I thought might be of interest:
 
 http://www.w3.org/Bugs/Public/show_bug.cgi?id=12220
 http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296
 
 Basically, HTML defines some algorithms for parsing integers, floats, 
 etc., which are used in converting DOM to IDL attributes for reflection 
 (among other things):
 
 http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#numbers
 
 The algorithms for parsing integers and floats are almost exactly the 
 same as ECMAScript's parseInt() and parseFloat(), down to some of the 
 language being copied word-for-word, but with subtle differences 
 involving (at least) whitespace handling.  IMO, this is bad for several 
 reasons:
 
 * It's confusing to both authors and implementers to have multiple 
 almost identical algorithms.  Nobody's going to expect the discrepancy 
 in the corner cases where it matters.

 * It's confusing to people reading the spec for there to be these extra 
 algorithms defined, whose relationship to the ES algorithms is not 
 obvious.  The HTML and ES algorithms are written in entirely different 
 styles and it's hard to tell what the differences are from side-by-side 
 inspections.

 * In at least some cases, all browsers match ES and none match the spec 
 -- see http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296#c4.

 * Browsers will have to maintain the ES algorithms as well as the HTML 
 algorithms, so even if the HTML algorithms are superior, it doesn't save 
 anyone the effort of understanding or implementing the ES algorithms.
 
 So I think HTML should just defer to ES here.

The reasons for not doing so are listed are:

 - The exact ES algorithm would need preprocessing anyway, to exclude 
   values like Infinity or NaN.

 - Having the algorithm depend on Unicode would mean HTML processing would 
   change over time without good reason. There's no need to support 
   non-ASCII characters in numeric attributes. (HTML generally is designed 
   to only use ASCII characters.)

 - It's simpler to implement from scratch if the HTML spec just defines 
   the algorithm than having to defer to another spec. This is especially 
   the case because the JS algorithms support features we don't need, e.g. 
   parseInt() supports a radix argument, and because the rules for parsing 
   floats in HTML are significantly more straight-forward than in ES.

 - The JS algorithms allow approximations that are unnecessary to support 
   in the HTML spec.

 - If you're writing an HTML tool, it's simpler to just use an HTML 
   library that defines the HTML algorithms than use both an HTML library 
   _and_ a JS library.

 - If you're writing a library, it's simpler to not have to include a JS 
   library just for a few parsing primitives.

 - If you're not going to use another library, then there's nothing gained 
   from referencing another spec.

 - It's simpler to spec and to understand if we're not deferring to other 
   specs for simple things like microsyntax parsers.


On Thu, 4 Aug 2011, Jonas Sicking wrote:
 
 It would make sense to me to match ES here.

With the exception of the definition of leading white space and how 
approximations are handled in the face of hardware limitations, we do 
match ES. An implementation that wanted to share common code here would be 
able to already.


On Fri, 5 Aug 2011, Jonas Sicking wrote:
 
 Sounds good. I'm for such a change yes.

There are two possible changes here: making the HTML spec's definition of 
parsing numbers use Unicode's varying definition of whitespace rather than 
a small set, making HTML parsing depend on non-ASCII values, or, just 
referencing the JS spec directly. For the reasons described above, I have 
not done either at this time.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'


Re: [whatwg] Discrepancies between HTML and ES rules for parsing an integer or float

2011-08-05 Thread Aryeh Gregor
On Fri, Aug 5, 2011 at 1:57 AM, Jonas Sicking jo...@sicking.cc wrote:
 It would make sense to me to match ES here. The main concern is of
 course website compat. Could someone detail what the differences would
 be compared to what implementations/the HTML5 spec do now?

As far as I know, the only difference between the HTML and ES
algorithms is handling of non-ASCII whitespace: ES treats it as
whitespace, HTML does not.  Specifically, ES treats StrWhiteSpaceChar
as leading whitespace:

http://es5.github.com/#x15.1.2.2

That includes any Unicode space separator (Zs), which in particular
changes over time (which seems to be Hixie's main objection IIUC).
HTML uses skip whitespace:

http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#signed-integers

Which if you follow the breadcrumbs means only [ \t\n\r\f].  So it's
almost never going to make any difference in practice, we're talking
only about corner cases.

I have a simple test-case at
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296#c4 that shows
all browsers strip leading \x0b (vertical tab) when converting DOM
attributes to ints, which matches ES and not HTML.

 For parsing floats this would not seem like a problem though since
 attributes containing floats is relatively new IIRC.

Yes, that's correct.  There's definitely no compat issue here with
floats, but really there's not going to be any with ints either, since
it's going to be exceedingly rare that anyone will put Unicode
whitespace in DOM attributes that are reflected as integers and then
rely on them working.  So it's just a question of if we'd prefer the
algorithms to match or not.


Re: [whatwg] Discrepancies between HTML and ES rules for parsing an integer or float

2011-08-05 Thread Jonas Sicking
On Fri, Aug 5, 2011 at 8:43 AM, Aryeh Gregor a...@aryeh.name wrote:
 On Fri, Aug 5, 2011 at 1:57 AM, Jonas Sicking jo...@sicking.cc wrote:
 It would make sense to me to match ES here. The main concern is of
 course website compat. Could someone detail what the differences would
 be compared to what implementations/the HTML5 spec do now?

 As far as I know, the only difference between the HTML and ES
 algorithms is handling of non-ASCII whitespace: ES treats it as
 whitespace, HTML does not.  Specifically, ES treats StrWhiteSpaceChar
 as leading whitespace:

 http://es5.github.com/#x15.1.2.2

 That includes any Unicode space separator (Zs), which in particular
 changes over time (which seems to be Hixie's main objection IIUC).
 HTML uses skip whitespace:

 http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#signed-integers

 Which if you follow the breadcrumbs means only [ \t\n\r\f].  So it's
 almost never going to make any difference in practice, we're talking
 only about corner cases.

 I have a simple test-case at
 http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296#c4 that shows
 all browsers strip leading \x0b (vertical tab) when converting DOM
 attributes to ints, which matches ES and not HTML.

 For parsing floats this would not seem like a problem though since
 attributes containing floats is relatively new IIRC.

 Yes, that's correct.  There's definitely no compat issue here with
 floats, but really there's not going to be any with ints either, since
 it's going to be exceedingly rare that anyone will put Unicode
 whitespace in DOM attributes that are reflected as integers and then
 rely on them working.  So it's just a question of if we'd prefer the
 algorithms to match or not.

Sounds good. I'm for such a change yes.

/ Jonas


Re: [whatwg] Discrepancies between HTML and ES rules for parsing an integer or float

2011-08-04 Thread Jonas Sicking
On Wed, Aug 3, 2011 at 11:21 AM, Aryeh Gregor a...@aryeh.name wrote:
 Hixie just WONTFIXed two bugs that I thought might be of interest:

 http://www.w3.org/Bugs/Public/show_bug.cgi?id=12220
 http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296

 Basically, HTML defines some algorithms for parsing integers, floats,
 etc., which are used in converting DOM to IDL attributes for
 reflection (among other things):

 http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#numbers

 The algorithms for parsing integers and floats are almost exactly the
 same as ECMAScript's parseInt() and parseFloat(), down to some of the
 language being copied word-for-word, but with subtle differences
 involving (at least) whitespace handling.  IMO, this is bad for
 several reasons:

 * It's confusing to both authors and implementers to have multiple
 almost identical algorithms.  Nobody's going to expect the discrepancy
 in the corner cases where it matters.
 * It's confusing to people reading the spec for there to be these
 extra algorithms defined, whose relationship to the ES algorithms is
 not obvious.  The HTML and ES algorithms are written in entirely
 different styles and it's hard to tell what the differences are from
 side-by-side inspections.
 * In at least some cases, all browsers match ES and none match the
 spec -- see http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296#c4.
 * Browsers will have to maintain the ES algorithms as well as the HTML
 algorithms, so even if the HTML algorithms are superior, it doesn't
 save anyone the effort of understanding or implementing the ES
 algorithms.

 So I think HTML should just defer to ES here.  Hixie disagrees, and
 has resolved both bugs twice now, so I'm not going to reopen them
 myself at this point.  However, I'd like to hear from implementers
 whether they're willing to implement the spec as it stands, or whether
 they want the spec algorithms to be identical to ES's algorithms.

It would make sense to me to match ES here. The main concern is of
course website compat. Could someone detail what the differences would
be compared to what implementations/the HTML5 spec do now?

For parsing floats this would not seem like a problem though since
attributes containing floats is relatively new IIRC.

/ Jonas


[whatwg] Discrepancies between HTML and ES rules for parsing an integer or float

2011-08-03 Thread Aryeh Gregor
Hixie just WONTFIXed two bugs that I thought might be of interest:

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12220
http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296

Basically, HTML defines some algorithms for parsing integers, floats,
etc., which are used in converting DOM to IDL attributes for
reflection (among other things):

http://www.whatwg.org/specs/web-apps/current-work/multipage/common-microsyntaxes.html#numbers

The algorithms for parsing integers and floats are almost exactly the
same as ECMAScript's parseInt() and parseFloat(), down to some of the
language being copied word-for-word, but with subtle differences
involving (at least) whitespace handling.  IMO, this is bad for
several reasons:

* It's confusing to both authors and implementers to have multiple
almost identical algorithms.  Nobody's going to expect the discrepancy
in the corner cases where it matters.
* It's confusing to people reading the spec for there to be these
extra algorithms defined, whose relationship to the ES algorithms is
not obvious.  The HTML and ES algorithms are written in entirely
different styles and it's hard to tell what the differences are from
side-by-side inspections.
* In at least some cases, all browsers match ES and none match the
spec -- see http://www.w3.org/Bugs/Public/show_bug.cgi?id=12296#c4.
* Browsers will have to maintain the ES algorithms as well as the HTML
algorithms, so even if the HTML algorithms are superior, it doesn't
save anyone the effort of understanding or implementing the ES
algorithms.

So I think HTML should just defer to ES here.  Hixie disagrees, and
has resolved both bugs twice now, so I'm not going to reopen them
myself at this point.  However, I'd like to hear from implementers
whether they're willing to implement the spec as it stands, or whether
they want the spec algorithms to be identical to ES's algorithms.