On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote: > On 13/10/2009, Stefan Behnel <[email protected]> wrote: > > > > I wonder why the parser stops parsing here, though. Is '\0' explicitly > > considered an invalid character in (broken) HTML, or is it really just the > > usual C EOS slip? > > It's certainly invalid, though could be recoverable. > > In the various html versions: HTML 4 defers to the SGML spec which I'm > not rich enough to consult, XHTML 1 defers to XML which we all know > says nulls are verboten, and the current HTML 5 draft is pretty clear: > > <http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream> > > "All U+0000 NULL characters in the input must be replaced by U+FFFD > REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse > error." > > (this is all in the context of an decoded-to-unicode stream, not raw > UTF-16 etc.)
When HTML5 will become a Last Call draft or something then I think it will make sense to try to update the parser to use the same recovery tricks. Note that the 0 in content may have cut the input at the Python->C interface layer. But sure libxml2 internals don't like 0 in content. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ [email protected] | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
