On Wed, 2010-03-17 at 15:37 +0100, Jan Klopper wrote:
> Dear,
> 
> I am using hubbub in a html and xml parsing project, but have recently
> come upon a problem.

Hubbub is solely an HTML parser. It will convert any given stream of
octets into a tree as per HTML5. This means that it's highly likely that
XML input will not produce the DOM you expect.

> When a self closing tag without arguments is found, it's being treated
> as a non closing tag, and thus envelopes all tags that follow it.

Yes. This is correct for parsing HTML. The presence of attributes should
have no effect on whether the self-closing flag is respected or not. Per
HTML 5, the self-closing flag is only respected on a few elements
(<br /> will self-close, for example).

> The following XML:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
>   <a />
>   <b x="1"/>
>   <d />
>   <e />
> </root>

As I've said above, Hubbub is an HTML parser, so will parse this input
as if it were HTML.

> is parsed as:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
>  <a>   
>    <b x="1">
>     <c>
>       <d></d>
>     </c>
>    </b>
>  </a>
> </root>

Where has node C appeared from? It's not in the example input, above.
Similarly, node E has disappeared.

I would expect the input you provided to be parsed into something that
looks like:

<!-- ?xml version="1.0" encoding="UTF-8"? -->
<html>
 <head>
 <body>
  <root>
   <a>
    <b x="1">
     <d>
      <e>

I.E. Hubbub will convert the xml PI into a comment, auto-generate the
missing html, head, and body nodes, and then insert the remaining
elements ignoring any self-closing flags, as none of the element names
in question result in the self-closing flag being respected.


J.


Reply via email to