Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Niels Dossche Fri, 29 Sep 2023 14:18:19 -0700

Hi Dennis

On 9/29/23 20:20, Dennis Snell wrote:
>>
>>>
>>> For both, `XMLDocument::fromEmpty` and `HTMLDocument::createEmpty` there is 
>>> an argument available to define the encoding but none of the other 
>>> `createFrom*` methods have this argument.
>>>
>>> As far as I understand, in the these other cases the encoding gets detected 
>>> from the content of the passed source but what happens is the source does 
>>> not contain any information about the encoding?. E.g. you load an XML/HTML 
>>> document over HTTP, the encoding is defined via HTTP header but the content 
>>> itself doesn't contain it.
>>>
>>
>> Right, we follow the HTML spec in this regard. Roughly speaking we determine 
>> the charset in the following order of priorities.
>> If one option fails, it will fall through to the next one.
>> 1. The Content-Type HTTP header from which you loaded the document.
>> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend 
>> the content with byte markers. This is used to detect encoding.
>> 3. Meta tag in the content.
>>
>> If it could not be determined at all, UTF-8 will be assumed as it's the 
>> default in HTML.
> 
> It may sound meticulous, but I’ve tried to emphasize `createFragment()` in 
> what’s being built in WordPress because almost everything being done on HTML 
> within WordPress, and I think within many frameworks, is processing fragments 
> (and usually short ones at that). Formerly I didn’t realize there was much of 
> a difference, but text encoding is one of those differences. It’s my 
> understanding that when parsing a fragment we have to assume an encoding, 
> unless the fragment is starting at a spot in the document before that’s 
> discovered, presumably only if we’ve constructed a Document with a 
> still-unknown encoding.
>


Just chiming in here to say that while we don't offer a createFragment() in 
this proposal, it's possible to parse fragments by passing the 
LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer 
innerHTML which you could use then in conjunction with createDocumentFragment().

> So manually setting the encoding of a fragment constructor is not so much 
> overriding as it is supplying, or at least, that’s one of two normative 
> situations. If we create a fragment with a context node carrying an encoding 
> already, then we need to ignore any meta tag that specifies otherwise; 
> likewise if the context node doesn’t carry that encoding we do need to heed 
> it.

Sure, I agree it's not overriding in that specific case. In other cases it can 
be.
There may not be an ideal naming that works for all cases.

> 
> I know there’s a huge difference in needs here between people writing scripts 
> to scrape full HTML documents, but it’s not a small fraction of cases where 
> people want to use DOMDocument without having the full HTML from start to 
> finish. In the world I work in it’s usually either for parsing a small 
> fragment to add some attributes or replace a wrapping tag, or for 
> constructing HTML programmatically to avoid escaping issues and make nesting 
> easy. In both of these cases the text encoding is implicit unless the 
> function signature makes it explicit. At this stage in development, we only 
> support some of the “in body” parsing and only support UTF-8, but I thought 
> that it was important enough to add these as arguments to the creator 
> function so that there’s an awareness that these values govern how the parse 
> occurs.
> 
> Surely for `createFromString()` and `createEmpty()` we can make the 
> assumption that no character encoding is set, but I also suspect that a 
> possible majority of the times people use these functions they are likely 
> calling them when `createFragment()` is more appropriate, that they aren’t 
> supplying HTML documents with in-band text encoding information, and so 
> there’s a chance that de-emphasizing the parameter may be technically more 
> accurate and practically less helpful.

Thanks for the insight.
To be honest, I don't have hard feelings about the naming of the parameter.
In a way you could say that override_encoding is still accurate, because the 
fallback default is UTF-8, so you override the fallback in a sense. As the 
documentation also emphasizes that the DOM extension works internally with 
UTF-8, this may align with expectations of programmers, but I'm not sure.
I think we should really document the parameter very well in the docs.

> 
> Love seeing all the continued work on this!
> Thank you so much for your dedication to it.
> 
> Dennis Snell

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to