Re: [xml] html parsing incomplete - bug?

Stefan Behnel Tue, 13 Oct 2009 05:53:29 -0700

Daniel Veillard wrote:
> On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote:
>> On 13/10/2009, Stefan Behnel wrote:
>>> I wonder why the parser stops parsing here, though. Is '\0' explicitly
>>> considered an invalid character in (broken) HTML, or is it really just the
>>> usual C EOS slip?
>> It's certainly invalid, though could be recoverable.
>>
>> In the various html versions: HTML 4 defers to the SGML spec which I'm
>> not rich enough to consult, XHTML 1 defers to XML which we all know
>> says nulls are verboten, and the current HTML 5 draft is pretty clear:
>>
>> <http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream>
>>
>> "All U+0000 NULL characters in the input must be replaced by U+FFFD
>> REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
>> error."
>>
>> (this is all in the context of an decoded-to-unicode stream, not raw
>> UTF-16 etc.)
> 
>   When HTML5 will become a Last Call draft or something then I think it
> will make sense to try to update the parser to use the same recovery
> tricks.


In any case, the parser should either apply the above replacement rule or
report an error when encountering a '\0' byte in the input stream.
Currently, it just silently terminates.


> Note that the 0 in content may have cut the input at the Python->C
> interface layer. But sure libxml2 internals don't like 0 in content.

We also pass UCS4 encoded data though the same code, so, no, that's not an
issue here.

Stefan
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] html parsing incomplete - bug?

Reply via email to