Re: [whatwg] Fwd: Entity parsing

2009-07-30 Thread Ian Hickson
On Sat, 18 Jul 2009, �istein E. Andersen wrote:
> 
> Non-semicolon-terminated entities that were conforming in HTML4, like 
> &pi and &mdash when they are not followed by a letter or digit (roughly 
> speaking), are currently expanded in Safari and Firefox, and requiring 
> this to change would be a regression affecting existing pages.
> 
> > As far as I can tell HTML5 more or less matches what legacy pages 
> > need,
> 
> You keep repeating this, and also that much work has been done to get 
> entity parsing right and that you really do not want to change it.  It 
> seems to me that you have tried to follow IE's behaviour closely, which 
> is not completely unreasonable.  I have not seen evidence of any 
> analysis of legacy pages supporting this decision, though; on the 
> contrary, more or less anecdotal evidence sent to the mailing list(s) 
> seems to suggest that certain modifications might make the algorithm 
> work better for legacy pages. Replicating IE may well be good enough and 
> seems like a reasonably safe option, but HTML5 does not completely 
> follow IE in other areas, and I do not quite see why entity parsing 
> should be treated differently.

It's certainly the case that we can find individual pages that depend on 
particular behaviours to support any argument.

I do not want to change the current parsing spec unless we have _very_ 
good reasons to do so, because there are now multiple implementations
and tests, and any change can introduce bugs and incompatibilities.

If you have strong data showing that a particular change to the spec would 
be highly beneficial, then it's something I'd be happy to consider. But 
I'm not willing to make changes just to change the spec from being 
compatible with IE to being compatible with WebKit, or some such. I need 
data showing that the change is needed.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Re: [whatwg] Fwd: Entity parsing

2009-07-17 Thread Øistein E . Andersen

On 5 Jun 2009, at 00:49, Ian Hickson wrote:


Could you give an example of what you mean? I'm having trouble  
following

your description



On Fri, 24 Apr 2009, Øistein E. Andersen wrote:



Let &IE4 (resp. &HTML4, &HTML5) be a non-semicolon-terminated named
character reference from the IE4 (resp. HTML4, HTML5) set,


&IE4 includes é, ï
&HTML4 includes in addition &pi, œ and
&HTML5 includes in addition &SHcy, &rcaron.


and let .
(full stop) represent any character other than semicolon, and ^
(circumflex) any character which is (roughly) not an ASCII letter or
digit (i.e., [^a-zA-Z0-9]).  Not completely unreasonable sets of
character references to expand (outside of attribute values) include:

1) &IE4^

  e.g., café (café)


2) &IE4.

  e.g., naïve (naïve)


3) &HTML4^

  e.g., 2&pi (2π)


4) &IE4. &HTML4^

  e.g., naïve (naïve), 2&pi (2π)


5) &HTML4.

  e.g., hors d'&oeliguvre (hors d'œuvre)


6) &IE4. &HTML5^

  e.g., naïve (naïve), &SHcy(A/K) [Ш(A/K)]


7) &HTML4. &HTML5^
  e.g., hors d'&oeliguvre (hors d'œuvre), &SHcy(A/K)  
[Ш(A/K)]


8) &HTML5.

  e.g., Dvo&rcaronák (Dvořák)


[...]
Currently, Opera follows 1),

 i.e., expands café, but not naïve or 2&pi

IE 2),

 i.e., expands café and naïve, but not &2pi

and Safari and Firefox 3).

 i.e., expands café and 2&pi, but not naïve



My main concern is that &HTML4^ is actually legitimate in HTML4 and
works in both Safari and Firefox today, and that HTML5 should not  
change
the rendering of valid HTML4 pages unless there is a good reason to  
do

so.


Non-semicolon-terminated entities that were conforming in HTML4, like  
&pi and &mdash when they are not followed by a letter or digit  
(roughly speaking), are currently expanded in Safari and Firefox, and  
requiring this to change would be a regression affecting existing pages.


As far as I can tell HTML5 more or less matches what legacy pages  
need,


You keep repeating this, and also that much work has been done to get  
entity parsing right and that you really do not want to change it.  It  
seems to me that you have tried to follow IE's behaviour closely,  
which is not completely unreasonable.  I have not seen evidence of any  
analysis of legacy pages supporting this decision, though; on the  
contrary, more or less anecdotal evidence sent to the mailing list(s)  
seems to suggest that certain modifications might make the algorithm  
work better for legacy pages. Replicating IE may well be good enough  
and seems like a reasonably safe option, but HTML5 does not completely  
follow IE in other areas, and I do not quite see why entity parsing  
should be treated differently.


--
Øistein E. Andersen

Re: [whatwg] Fwd: Entity parsing

2009-06-04 Thread Ian Hickson
On Fri, 24 Apr 2009, Øistein E. Andersen wrote:
> 
> When a named character reference is followed by a semicolon, it clearly 
> has to be expanded, but how to handle non-semicolon-terminated character 
> references is less obvious.
> 
> Let &IE4 (resp. &HTML4, &HTML5) be a non-semicolon-terminated named 
> character reference from the IE4 (resp. HTML4, HTML5) set, and let . 
> (full stop) represent any character other than semicolon, and ^ 
> (circumflex) any character which is (roughly) not an ASCII letter or 
> digit (i.e., [^a-zA-Z0-9]).  Not completely unreasonable sets of 
> character references to expand (outside of attribute values) include:
> 
>   1) &IE4^
>   2) &IE4.
>   3) &HTML4^
>   4) &IE4. &HTML4^
>   5) &HTML4.
>   6) &IE4. &HTML5^
>   7) &HTML4. &HTML5^
>   8) &HTML5.
> 
> (The set of character references to be expanded in attribute values 
> could be obtained by replacing . by ^ above.)
> 
> Currently, Opera follows 1), IE 2), and Safari and Firefox 3).
> 
> My main concern is that &HTML4^ is actually legitimate in HTML4 and 
> works in both Safari and Firefox today, and that HTML5 should not change 
> the rendering of valid HTML4 pages unless there is a good reason to do 
> so.

Could you give an example of what you mean? I'm having trouble following 
your description above.

As far as I can tell HTML5 more or less matches what legacy pages need, 
but if there are specific entities that should be parsed in a different 
way than HTML5 says they should, I'm happy to fix this.

-- 
Ian Hickson   U+1047E)\._.,--,'``.fL
http://ln.hixie.ch/   U+263A/,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'