On Wed, 2010-03-17 at 15:37 +0100, Jan Klopper wrote:
> Dear,
>
> I am using hubbub in a html and xml parsing project, but have recently
> come upon a problem.
Hubbub is solely an HTML parser. It will convert any given stream of
octets into a tree as per HTML5. This means that it's highly likely that
XML input will not produce the DOM you expect.
> When a self closing tag without arguments is found, it's being treated
> as a non closing tag, and thus envelopes all tags that follow it.
Yes. This is correct for parsing HTML. The presence of attributes should
have no effect on whether the self-closing flag is respected or not. Per
HTML 5, the self-closing flag is only respected on a few elements
(<br /> will self-close, for example).
> The following XML:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
> <a />
> <b x="1"/>
> <d />
> <e />
> </root>
As I've said above, Hubbub is an HTML parser, so will parse this input
as if it were HTML.
> is parsed as:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
> <a>
> <b x="1">
> <c>
> <d></d>
> </c>
> </b>
> </a>
> </root>
Where has node C appeared from? It's not in the example input, above.
Similarly, node E has disappeared.
I would expect the input you provided to be parsed into something that
looks like:
<!-- ?xml version="1.0" encoding="UTF-8"? -->
<html>
<head>
<body>
<root>
<a>
<b x="1">
<d>
<e>
I.E. Hubbub will convert the xml PI into a comment, auto-generate the
missing html, head, and body nodes, and then insert the remaining
elements ignoring any self-closing flags, as none of the element names
in question result in the self-closing flag being respected.
J.