Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Dennis Snell via internals Fri, 29 Sep 2023 11:20:55 -0700

> 
>> 
>> For both, `XMLDocument::fromEmpty` and `HTMLDocument::createEmpty` there is 
>> an argument available to define the encoding but none of the other 
>> `createFrom*` methods have this argument.
>> 
>> As far as I understand, in the these other cases the encoding gets detected 
>> from the content of the passed source but what happens is the source does 
>> not contain any information about the encoding?. E.g. you load an XML/HTML 
>> document over HTTP, the encoding is defined via HTTP header but the content 
>> itself doesn't contain it.
>> 
> 
> Right, we follow the HTML spec in this regard. Roughly speaking we determine 
> the charset in the following order of priorities.
> If one option fails, it will fall through to the next one.
> 1. The Content-Type HTTP header from which you loaded the document.
> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend 
> the content with byte markers. This is used to detect encoding.
> 3. Meta tag in the content.
> 
> If it could not be determined at all, UTF-8 will be assumed as it's the 
> default in HTML.


It may sound meticulous, but I’ve tried to emphasize `createFragment()` in 
what’s being built in WordPress because almost everything being done on HTML 
within WordPress, and I think within many frameworks, is processing fragments 
(and usually short ones at that). Formerly I didn’t realize there was much of a 
difference, but text encoding is one of those differences. It’s my 
understanding that when parsing a fragment we have to assume an encoding, 
unless the fragment is starting at a spot in the document before that’s 
discovered, presumably only if we’ve constructed a Document with a 
still-unknown encoding.

So manually setting the encoding of a fragment constructor is not so much 
overriding as it is supplying, or at least, that’s one of two normative 
situations. If we create a fragment with a context node carrying an encoding 
already, then we need to ignore any meta tag that specifies otherwise; likewise 
if the context node doesn’t carry that encoding we do need to heed it.

I know there’s a huge difference in needs here between people writing scripts 
to scrape full HTML documents, but it’s not a small fraction of cases where 
people want to use DOMDocument without having the full HTML from start to 
finish. In the world I work in it’s usually either for parsing a small fragment 
to add some attributes or replace a wrapping tag, or for constructing HTML 
programmatically to avoid escaping issues and make nesting easy. In both of 
these cases the text encoding is implicit unless the function signature makes 
it explicit. At this stage in development, we only support some of the “in 
body” parsing and only support UTF-8, but I thought that it was important 
enough to add these as arguments to the creator function so that there’s an 
awareness that these values govern how the parse occurs.

Surely for `createFromString()` and `createEmpty()` we can make the assumption 
that no character encoding is set, but I also suspect that a possible majority 
of the times people use these functions they are likely calling them when 
`createFragment()` is more appropriate, that they aren’t supplying HTML 
documents with in-band text encoding information, and so there’s a chance that 
de-emphasizing the parameter may be technically more accurate and practically 
less helpful.

Love seeing all the continued work on this!
Thank you so much for your dedication to it.

Dennis Snell
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to