> >> >> For both, `XMLDocument::fromEmpty` and `HTMLDocument::createEmpty` there is >> an argument available to define the encoding but none of the other >> `createFrom*` methods have this argument. >> >> As far as I understand, in the these other cases the encoding gets detected >> from the content of the passed source but what happens is the source does >> not contain any information about the encoding?. E.g. you load an XML/HTML >> document over HTTP, the encoding is defined via HTTP header but the content >> itself doesn't contain it. >> > > Right, we follow the HTML spec in this regard. Roughly speaking we determine > the charset in the following order of priorities. > If one option fails, it will fall through to the next one. > 1. The Content-Type HTTP header from which you loaded the document. > 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend > the content with byte markers. This is used to detect encoding. > 3. Meta tag in the content. > > If it could not be determined at all, UTF-8 will be assumed as it's the > default in HTML.
It may sound meticulous, but I’ve tried to emphasize `createFragment()` in what’s being built in WordPress because almost everything being done on HTML within WordPress, and I think within many frameworks, is processing fragments (and usually short ones at that). Formerly I didn’t realize there was much of a difference, but text encoding is one of those differences. It’s my understanding that when parsing a fragment we have to assume an encoding, unless the fragment is starting at a spot in the document before that’s discovered, presumably only if we’ve constructed a Document with a still-unknown encoding. So manually setting the encoding of a fragment constructor is not so much overriding as it is supplying, or at least, that’s one of two normative situations. If we create a fragment with a context node carrying an encoding already, then we need to ignore any meta tag that specifies otherwise; likewise if the context node doesn’t carry that encoding we do need to heed it. I know there’s a huge difference in needs here between people writing scripts to scrape full HTML documents, but it’s not a small fraction of cases where people want to use DOMDocument without having the full HTML from start to finish. In the world I work in it’s usually either for parsing a small fragment to add some attributes or replace a wrapping tag, or for constructing HTML programmatically to avoid escaping issues and make nesting easy. In both of these cases the text encoding is implicit unless the function signature makes it explicit. At this stage in development, we only support some of the “in body” parsing and only support UTF-8, but I thought that it was important enough to add these as arguments to the creator function so that there’s an awareness that these values govern how the parse occurs. Surely for `createFromString()` and `createEmpty()` we can make the assumption that no character encoding is set, but I also suspect that a possible majority of the times people use these functions they are likely calling them when `createFragment()` is more appropriate, that they aren’t supplying HTML documents with in-band text encoding information, and so there’s a chance that de-emphasizing the parameter may be technically more accurate and practically less helpful. Love seeing all the continued work on this! Thank you so much for your dedication to it. Dennis Snell -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php