Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Dennis Snell via internals Fri, 29 Sep 2023 14:39:15 -0700

> Just chiming in here to say that while we don't offer a createFragment() in 
> this proposal, it's possible to parse fragments by passing the 
> LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer 
> innerHTML which you could use then in conjunction with 
> createDocumentFragment().



It’s not my understanding that this is right here, because fragment parsing 
implies more than having or not having the HTML and BODY elements implicitly.


> Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of 
>implied html/body... elements.


The HTML5 spec defines fragment parsing as starting within a context node which 
exists within a broader document. For example, many people will parse a string 
of HTML that should form the contents of an LI element. They are grabbing that 
HTML from a database somewhere, from user input. If that HTML contains “</li>” 
then our behavior diverges. In a fragment parser it would close out the list we 
started with but in full document parsing mode the end tag would be ignored, a 
parse error. If the goal is to ensure that user input doesn’t break out and 
change the page, then it’s important to use fragment parsing and grab the inner 
contents of that LI context node.


This can be valuable to have as a tool to guard against injection attacks or 
against accidentally breaking the page someone is building, because the 
fragment parser is aware of its environment. It becomes even more important 
when parsing within RCDATA or RAWTEXT sections. For example, if wanting to 
parse and analyze or manipulate a web page’s title then the parser should treat 
everything as plaintext until it reaches the end or encounters a closing TITLE 
tag. If trying to do this with `createFromString()` then it’s up to the caller 
to remember to prepend and then remove the environment, `createFromString( 
‘<title>’ . $page_title . ‘</title>’ )`. The fragment parser would be similar 
in practice, but more explicit and hard to misunderstand in these circumstances.


This is complicated stuff. I understand that the spec provides for a wide 
variety of use-cases and needs, and that it’s hard to pin down exactly what a 
spec-compliant parser is supposed to do in all situations (it depends), so I’m 
only wanting to share from the perspective of people doing a lot of small HTML 
manipulation. There’s not much code out there using the fragment parser, but I 
can’t help but think that part of the reason is because it’s not exposed where 
it ought to be.


Have a great weekend!
Dennis Snell
>

Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to