Re: [xml] HTMLparser: SGML comments

Daniel Veillard Wed, 09 Nov 2005 00:23:20 -0800

On Wed, Nov 09, 2005 at 03:10:11PM +1100, Michael Day wrote:
> 
> Hi,
> 
> HTMLparser currently parses comments by looking for a --> to end the
> comment. However, this does not handle SGML comments, in which -- is used
> to toggle whether > ends the comment. It is possible for an SGML comment
> to look like this:
> 
>     <!-- Hel>lo -- world --> good>bye -- world >
> 
> The whole thing is one comment, broken down like this:
> 
>     "<!--"          starts the comment
>     " Hel>lo "      comment text ('>' is treated as text)
>     "--"            toggles state ('>' will end the comment)
>     " world "       comment text
>     "--"            toggles state ('>' will be treated as text)
>     "> good>bye "   comment text ('>' is treated as text)
>     "--"            toggles state ('>' will end the comment)
>     " world "       comment text
>     ">"             ends the comment
> 
> This looks pretty scary, but this is how Mozilla handles HTML comments in
> standards mode and Opera is going to do the same. The Acid2 test from the
> Web Standards Project includes an SGML comment:
> 
>     http://www.webstandards.org/act/acid2/
> 
> For further info on SGML comments in HTML, see:
> 
>     http://www.howtocreate.co.uk/SGMLComments.html
> 
> I have a patch for HTMLparser.c to make it parse SGML comments. It also
> strips "--" from the text of the comment node, which is different from the
> existing behaviour:
> 
>     <!-- Hello -->
>     comment(" Hello ")                // identical to old behaviour
> 
>     <!-- Hello ---- world -->
>     comment(" Hello  world ") // old behaviour includes "----"
> 
>     <!-- Hello -- --> -- world >
>     comment(" Hello  >  world ")
> 
> Stripping out the "--" from the text of the comment node also makes it
> possible to take documents that were parsed by HTMLparser and serialise
> them as well-formed XML, which is sometimes not possible now.
> 
> Would this patch be acceptable?


  Sounds a good idea to fix the parser bahaviour to be more correct, yes.
I don't really know SGML, so such patches are welcome. I just have one
problem with the code, it calls GROW only when the end of the buffer is
detected with a NUL, I would rather have it called more preemtively to
in the loop to avoid a potential weakness in the case of multibyte chars.
  Note also that I prefer patches than cut an paste of full routines, it
gives me the context of what was changed.

    thanks !

Daniel

-- 
Daniel Veillard      | Red Hat http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] HTMLparser: SGML comments

Reply via email to