Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

Niels Dossche Thu, 21 Sep 2023 13:38:30 -0700

Hi Dennis

On 21/09/2023 22:26, Dennis Snell wrote:
> 
> 
>> On Sep 19, 2023, at 12:30 AM, Tim Düsterhus <t...@bastelstu.be> wrote:
>>
>> From the perspective of the user of the API, I like the symmetry between all 
>> the named constructors:
>>
>> Whenever I want to create a new document, I use one of the fromXyz() 
>> methods. And when I use those, I get exactly what it says on the tin.
>>
>> Making the regular constructor available for use would effectively make 
>> whatever that constructor does a special case / the default case.
> 
> 
> Just wanted to voice my support for this too. In the WordPress HTML API I 
> regret exposing a public constructor on the underlying lexer because that 
> prevented me from hiding the constructor on the HTML parsing class, leading 
> to an awkward convention to try and beg people not to use it.
> 
> Why? the constructor performs various state initialization important for 
> HTML, not only for the encoding, but in our case I felt it was important 
> enough that we provide both a full-document interface and also the HTML 
> fragment parser, since almost all of the actual HTML parsing we do is one 
> fragments. This _could_ be something to consider here in the HTMLDocument API 
> as well.
> 
> The use of the constructor directly opens up opportunities to bypass that 
> state initialization which could lead to incorrect parse results, not only 
> for encoding issues, but also in getting way off track in parsing. E.g. if 
> we’re parsing the inner contents of a TEMPLATE element then a closing BODY 
> tag should be ignored whereas in other parsing modes it closes the BODY 
> element.
> 
>> For the HtmlDocument I find it less obvious that creating an empty document 
>> would be the default case compared to loading an existing document from a 
>> file.
> 
> 
> To hit the point home, “creating an empty document” seems a little unclear in 
> terms of HTML as well. Are we creating am empty DOM class instance without 
> any actual DOM? Are we creating a <!DOCTYPE html> DOM with no nodes? If we 
> proceed by feeding it an HTML string and parsing then we can resolve the 
> ambiguity, but if we start inserting DOM nodes directly that leaves us in an 
> undefined starting point.
> 
> Presumably we don’t need this empty method because we could `loadString( ‘’ 
> )`, but that leaves open the ambiguities of what kind of empty we’re making. 
> In my experience with HTML, any kind of “empty” is going to need a definition 
> of its own, and a static creator method is the right place to define and 
> parameterize that, as in the updated RFC, though maybe with an additional 
> clarification of what “empty” means and what implications it carries for 
> ongoing manipulation.
>


To clarify what createEmpty() does and how it differs from createFromString(''):
createEmpty() creates a HTMLDocument whose DOM tree is empty (i.e. no nodes): 
no html element and not even a doctype. It is the replacement for the 
DOMDocument constructor. This is intended for people who create everything 
themselves from scratch, or people trying to construct only fragments. This 
does not invoke the parser and its rules.
createFromString('') will not produce an empty DOM tree. That's because the 
parsing rules tell us that the html, head and body element are always created. 
This means the resulting DOM document will look like this: 
<html><head></head><body></body></html>.

> —
> 
> Having static creator methods with docblocks and named arguments to me is a 
> great way to not only guide people to use these classes safely, but also to 
> teach some of the nuances that have historically been overlooked in PHP’s 
> HTML handling, e.g. `html_entity_decode()` is unaware of of the ambiguous 
> ampersand and offers no way to correctly parse certain kinds of invalid named 
> character references.
> 

I agree.

> Kindly,
> Dennis Snell

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to