[whatwg] Handling /br in the after head insertion mode

2008-12-04 Thread Tommy Thorsen

Consider the following simple markup:

!doctype html/br

If I run it through my parser, which is implemented after the html5 
algorithm, the resulting dom is as follows:


html
   head
   body

The br end tag is a bit special, and should be handled as if it was a br 
start tag. What happens here is as follows: The before head insertion 
mode will, upon receiving a br end tag, create a head node and switch to 
the in head insertion mode. in head will close the head node and 
move on to the after head insertion mode. I was expecting after head 
to see the /br and do like it does on a start tag, which is to create 
a body node and move to the in body state, but the /br is just ignored.


I've changed my implementation of after head to handle /br just like 
the in head insertion mode, which is:


   An end tag whose tag name is br
   Act as described in the anything else entry below.

This results in the following dom, for the example above:

html
   head
   body
  br

This matches Internet Explorer and Opera, but not Firefox and Safari. 
Then again, it looks like Firefox and Safari ignore all /br tags.



Regards,
Tommy


Re: [whatwg] Handling /br in the after head insertion mode

2008-12-04 Thread Tommy Thorsen

Anne van Kesteren wrote:

On Thu, 04 Dec 2008 11:51:05 +0100, timeless [EMAIL PROTECTED] wrote:
On Thu, Dec 4, 2008 at 12:34 PM, Tommy Thorsen [EMAIL PROTECTED] 
wrote:
This matches Internet Explorer and Opera, but not Firefox and 
Safari. Then

again, it looks like Firefox and Safari ignore all /br tags.


if we're both able to get away with ignoring all /br tags, wouldn't
the ideal forward path be to make always ignored?


Firefox and Safari do not ignore it in quirks mode. I rather keep the 
amount of differences between quirks and standards mode to a minimum 
(by doing what Opera and Internet Explorer do) than increase it.




Aha, I hadn't even thought about quirks vs. non-quirks. I definitely 
agree that keeping the differences to a minimum is a good thing.


Also, it's not like the specified parsing algorithm ignores /br 
completely. All the insertion modes from before html to in body have 
special handling for it, except after head. This makes it look more 
like someone just forgot to put that /br handling in after head than 
like a deliberate decision to ignore the /br.


For the record, the following markup:

!doctype htmlbody/br

results in:

html
   head
   body
  br

with the current algorithm, because the in body insertion mode treats 
/br as if it was a br.




Re: [whatwg] Scoping elements and nested paragraphs

2008-12-02 Thread Tommy Thorsen

Ian Hickson wrote:

On Wed, 12 Nov 2008, Tommy Thorsen wrote:
  

Consider the following markup:
   pobjectpX/p/p

The html5 parsing algorithm produces the following tree:
   
htmlhead/headbodypobjectpX/pp/p/object/p/body/html

whereas Firefox and Opera both produce:
   htmlhead/headbodypobjectpX/p/object/p/body/html

and IE produces:
   htmlhead/headbodypobject/object/p/body/html

The main problem with the html5 output, in my opinion, is the extra p/p
inside the object. This happens because object is a scoping element and
the final /p is not able to find the first p.

I've fixed this in our implementation by implementing the first paragraph in
'An end tag whose name is p' in in body as if it said:

---
If the stack of open elements does not have an element in scope  with the same
tag name as that of the token, then this is a parse error

If the stack of open elements does not contain an element with the same tag
name as that of the token, then act as if a start tag with the tag name p had
been seen, then reprocess the current token.
---



I don't really see this as a critical issue; did this break any pages? 
Since WebKit does what HTML5 does here, I've left the spec as is.


  


This does not, as far as I know, break any real pages. We did discuss 
the issue in the irc-channel after I sent this mail 
(http://krijnhoetmer.nl/irc-logs/whatwg/20081112#l-285 and onwards) and 
we came to the same conclusion as you. I've reverted the change I did to 
our parser so that we follow the specification.


regards,
Tommy


[whatwg] Scoping elements and nested paragraphs

2008-11-12 Thread Tommy Thorsen

Consider the following markup:
   pobjectpX/p/p

The html5 parsing algorithm produces the following tree:
   
htmlhead/headbodypobjectpX/pp/p/object/p/body/html


whereas Firefox and Opera both produce:
   htmlhead/headbodypobjectpX/p/object/p/body/html

and IE produces:
   htmlhead/headbodypobject/object/p/body/html

The main problem with the html5 output, in my opinion, is the extra 
p/p inside the object. This happens because object is a scoping 
element and the final /p is not able to find the first p.


I've fixed this in our implementation by implementing the first 
paragraph in 'An end tag whose name is p' in in body as if it said:


---
If the stack of open elements does not have an element in scope  with 
the same tag name as that of the token, then this is a parse error


If the stack of open elements does not contain an element with the same 
tag name as that of the token, then act as if a start tag with the tag 
name p had been seen, then reprocess the current token.

---


Best regards,
Tommy


Re: [whatwg] Handling title inside body

2008-11-11 Thread Tommy Thorsen

Ian Hickson wrote:

On Mon, 10 Nov 2008, Tommy Thorsen wrote:
  
FWIW: In our implementation, I've changed the handling of base and 
title in in body to:


   Process the token using the rules for the after head insertion 
   mode.


instead of processing them with the rules for in head.



Unless there are pages that depend on this (are there?), I'm very 
reluctant to change the spec in this way.
  


No, I haven't seen any pages that depend on this. I'll revert this 
change in our implementation one day soon, and try to match Opera's 
behaviour instead.




Re: [whatwg] script tag handling in after head

2008-11-11 Thread Tommy Thorsen

Henri Sivonen wrote:

On Nov 11, 2008, at 13:27, Tommy Thorsen wrote:

The assertion that the current node is still the head element pointer 
does not seem correct, as we push a script element onto the stack of 
open elements in the A start tag whose tag name is 'script' section 
of the in head insertion mode.



This is http://www.w3.org/Bugs/Public/show_bug.cgi?id=6038



Ah, thank you. I hadn't really looked at that bugzilla before. I added 
my solution as a third possible fix.


I think the first solution in the description of that bug sounds nice, 
but I'm not too fond of the second solution. (The one which involves 
sometimes popping one extra time.) Just my two cents' worth.




Re: [whatwg] Handling title inside body

2008-11-11 Thread Tommy Thorsen

Ian Hickson wrote:

On Mon, 10 Nov 2008, Tommy Thorsen wrote:
From an implementors point of view, it's good to have clearly defined 
boundaries between modules. An implementation would typically have one 
module that tokenises and parses html and one module that renders the 
resulting dom to the screen. If all the unexpected input is dealt with 
in the parsing module, then you can make some assumptions in the 
rendering module which can greatly simplify the implementation. Having 
to deal with an arbitrary amount of illegal input in either module is, 
IMHO, not the ideal design.



Unfortunately, we have little choice in the matter. Scripting and XML both 
allow you to unambiguously create highly non-conforming DOMs, e.g. with 
title elements as the root element and html elements as children of 
input elements. The renderer has to deal with all such DOMs.


  


I just came across another related problem. Consider the following markup:

!doctype htmlselecttitleTITLE/title/select

My version of Firefox moves the title to head, Opera ignores the title 
completely, and the html 5 parsing algorithm produces the following 
peculiar markup:


!DOCTYPE html
html
   head/head
   body
   selectTITLE/select
   /body
/html

Should this title be allowed or ignored? Right now we ignore the start 
and end tags, but insert the CDATA into the select element. I'm tempted 
to ignore CDATA unless the current node is an option element in the in 
select insertion mode.



Since we were discussing scripts creating unexpected DOMs, I had to try 
the following:


!doctype html
script
   function button_onclick() {
   document.getElementById('myselect').innerHTML = 
'titleTITLE/title';

   alert('title inserted');
   }
/script
select id=myselect/select
input type=button value=Make Title onclick=button_onclick(); /

On Firefox, the title is inserted into the select element, but does not 
actually work. Opera seems to prevent the title element from being 
inserted into the select element altogether.


-Tommy


[whatwg] script tag handling in after head

2008-11-11 Thread Tommy Thorsen
This one is kinda complex, but I'll try to explain the problem. The 
algorithm for handling script start tags in the after head insertion 
mode requires us to push the head element pointer onto the stack of open 
elements, then process the token using the rules for the in head 
insertion mode. Finally we are required to:


   Pop the current node (which will be the node pointed to by the head 
element pointer)


The assertion that the current node is still the head element pointer 
does not seem correct, as we push a script element onto the stack of 
open elements in the A start tag whose tag name is 'script' section of 
the in head insertion mode.


Alternatively, the script tag handling in in head could be interpreted 
to not require us to push the script element onto the stack of open 
elements, but then the following assertion in An end tag whose tag name 
is 'script' in in CDATA/RCDATA will not hold true:


   Let script be the current node  (which will be a script element).

In our implementation, I've chosen to implement the A start tag token 
whose tag name is one of: 'base', 'link', 'meta', 'noframes', 'script', 
'style', 'title' as if it said:


   Parse error.
   Let /node/ be the head element pointer
   Push the node pointed to by the head element pointer onto the stack 
of open elements.

   Process the token using the rules for the in head insertion mode.
   Remove /node/ from the stack of open elements

I haven't come across any problems with this approach so far...


regards,
Tommy


Re: [whatwg] Handling title inside body

2008-11-10 Thread Tommy Thorsen

Simon Pieters wrote:
The description of the title element in the spec (4.2.2 The title 
element) says:


Contexts in which this element may be used:
In a head element containing no other title elements.

I don't care very strongly about whether or not title elements are 
allowed anywhere, but I do think the output of the parsing algorithm 
should be valid html according to the rest of the spec.


Why?


Hmm. Good question. If not, then why do we do foster parenting at all?

From an implementors point of view, it's good to have clearly defined 
boundaries between modules. An implementation would typically have one 
module that tokenises and parses html and one module that renders the 
resulting dom to the screen. If all the unexpected input is dealt with 
in the parsing module, then you can make some assumptions in the 
rendering module which can greatly simplify the implementation. Having 
to deal with an arbitrary amount of illegal input in either module is, 
IMHO, not the ideal design.


That being said, if Opera has the specified behaviour and Firefox is 
switching to it, then I'm no way near stubborn enough to keep arguing. 
As long as I'm reassured this is a concious decision and not an 
oversight on whatwg's part, I'll just go ahead and implement it after 
the spec.


[whatwg] Handling title inside body

2008-11-10 Thread Tommy Thorsen
I noticed that, according to the html5 algorithm, when the parser sees a 
title start tag when in the in body insertion mode, it's not 
supposed to relocate it to the head element. Opera matches this 
behaviour, but Firefox moves any title tag  it finds into the head element.


The description of the title element in the spec (4.2.2 The title 
element) says:


   Contexts in which this element may be used:
   In a head element containing no other title elements.

I don't care very strongly about whether or not title elements are 
allowed anywhere, but I do think the output of the parsing algorithm 
should be valid html according to the rest of the spec. So, in my 
opinion, we need to change either the allowed context of the title 
element, or the parsing algorithm.


I think everything I've said about the title element also applies to 
the base element.



FWIW: In our implementation, I've changed the handling of base and 
title in in body to:


   Process the token using the rules for the after head insertion mode.

instead of processing them with the rules for in head.


Best regards,
Tommy


[whatwg] li start tag algorithm clarification.

2008-11-10 Thread Tommy Thorsen
In the handler for 'A start tag whose tag name is li' in in body, 
the algorithm says jump to the last step in a couple of places. Is 
the last step step 5, or is it the final unnumbered step which says, 
Finally, insert an HTML element for the token?


I suggest, to make this clearer, that we give the final step a number 
(the number 6 comes to mind) and change jump to the last step to jump 
to step 5/6.


This also goes for the next section (the one called 'A start tag whose 
name is one of: dd, dt').



regards,
Tommy


[whatwg] parsing nested forms

2008-11-06 Thread Tommy Thorsen

Hi again!

Before I get to the real issue, I think I should give you a little bit 
of background. I'm working for a company which makes a web browser. 
We've been having some problems with our algorithm for parsing illegal 
html, so we decided to scrap the whole module and implement the 
algorithm exactly as outlined in the html5 spec. So far this has been a 
great success. We're already way better than we used to be, but there 
are some situations where the html5 parsing algorithm does not quite 
give us the result we expected.


Yesterday I noticed that we were not displaying the site 
http://bankrate.com correctly. The problem we had on that page boils 
down to the following markup:


div id=firstdiv
   A
   div id=seconddiv
   form id=firstform
   div id=thirddiv
   form id=secondform/form
   /div
   /form
   /div
   B
/div

I'll walk you through it; Everything is normal until we reach the start 
tag for the secondform. It is ignored, since we're already in a form 
(the form element pointer points to firstform.) Then we see the end 
tag which was meant for secondform. We pop elements from the stack of 
open elements until we find a form element (which is firstform) 
popping off thirddiv in the process. The next token we get is the end 
div tag which was meant for thirddiv. Since thirddiv is already 
gone, we pop seconddiv instead, and now we're sort of off-balance. The 
result is that A and B does not end up as children of the same div.


If any of you would like to see the effect this can have on a real page, 
you can use the parse.py script in html5lib. On a command line, use the 
following commands:


[EMAIL PROTECTED] html]$ wget -k -O bankrate.html http://bankrate.com
[EMAIL PROTECTED] html]$ /path/to/html5lib/python/parse.py 
bankrate.html  bankrate_parsed.html

[EMAIL PROTECTED] html]$ firefox bankrate_parsed.html

I've applied a fix to our code which makes us handle this particular 
case better. I haven't tested it very thoroughly, but the change is to 
implement the 'An end tag whose tag name is form' section in in body 
as if it said:


--
An end tag whose tag name is form

   Let /node/ be the form element pointer
   Set the form element pointer to null.

   If the stack of open elements does not have an element in scope with 
the same tag name as that of the token, then this is a parse error; 
ignore the token.


   Otherwise, run these steps:

  1. Generate implied end tags.
  2. If the current node is not an element with the same tag name 
as that of the token, then this is a parse error.

  3. Remove /node/ from the stack of open elements
--

This seems to give us pretty much the same behaviour as Opera for the 
simple example above. Can any of you see any potential problems with 
this approach? In any case, I do believe that the specification needs to 
be changed one way or another, so that it handles this case better.


I think I have a couple of other instances where we've had to deviate 
from the specification in order to tackle problems discovered by our 
testers, and if any of you are interested in this kind of feedback, I'll 
dig them out and post them on this list.


Best regards
Tommy Thorsen


[whatwg] end body tag parsing clarification

2008-11-05 Thread Tommy Thorsen
I've been looking at the parsing chapter of the HTML5 specification, and 
I've found something which I don't think makes sense. The last two 
sentences in the 'An end tag whose tag name is body' section in the 
in body insertion mode says:


  Switch the insertion mode to after body. Otherwise, ignore the token.

The Otherwise does not really make sense in this context, does it? 
Should that last sentence just be erased?


Regards,
Tommy Thorsen