Hi Dennis On 9/29/23 20:20, Dennis Snell wrote: >> >>> >>> For both, `XMLDocument::fromEmpty` and `HTMLDocument::createEmpty` there is >>> an argument available to define the encoding but none of the other >>> `createFrom*` methods have this argument. >>> >>> As far as I understand, in the these other cases the encoding gets detected >>> from the content of the passed source but what happens is the source does >>> not contain any information about the encoding?. E.g. you load an XML/HTML >>> document over HTTP, the encoding is defined via HTTP header but the content >>> itself doesn't contain it. >>> >> >> Right, we follow the HTML spec in this regard. Roughly speaking we determine >> the charset in the following order of priorities. >> If one option fails, it will fall through to the next one. >> 1. The Content-Type HTTP header from which you loaded the document. >> 2. BOM sniffing in the content. I.e. UTF-8 with BOM and UTF-16 LE/BE prepend >> the content with byte markers. This is used to detect encoding. >> 3. Meta tag in the content. >> >> If it could not be determined at all, UTF-8 will be assumed as it's the >> default in HTML. > > It may sound meticulous, but I’ve tried to emphasize `createFragment()` in > what’s being built in WordPress because almost everything being done on HTML > within WordPress, and I think within many frameworks, is processing fragments > (and usually short ones at that). Formerly I didn’t realize there was much of > a difference, but text encoding is one of those differences. It’s my > understanding that when parsing a fragment we have to assume an encoding, > unless the fragment is starting at a spot in the document before that’s > discovered, presumably only if we’ve constructed a Document with a > still-unknown encoding. >
Just chiming in here to say that while we don't offer a createFragment() in this proposal, it's possible to parse fragments by passing the LIBXML_HTML_NOIMPLIED option. Alternatively, in the future I plan to offer innerHTML which you could use then in conjunction with createDocumentFragment(). > So manually setting the encoding of a fragment constructor is not so much > overriding as it is supplying, or at least, that’s one of two normative > situations. If we create a fragment with a context node carrying an encoding > already, then we need to ignore any meta tag that specifies otherwise; > likewise if the context node doesn’t carry that encoding we do need to heed > it. Sure, I agree it's not overriding in that specific case. In other cases it can be. There may not be an ideal naming that works for all cases. > > I know there’s a huge difference in needs here between people writing scripts > to scrape full HTML documents, but it’s not a small fraction of cases where > people want to use DOMDocument without having the full HTML from start to > finish. In the world I work in it’s usually either for parsing a small > fragment to add some attributes or replace a wrapping tag, or for > constructing HTML programmatically to avoid escaping issues and make nesting > easy. In both of these cases the text encoding is implicit unless the > function signature makes it explicit. At this stage in development, we only > support some of the “in body” parsing and only support UTF-8, but I thought > that it was important enough to add these as arguments to the creator > function so that there’s an awareness that these values govern how the parse > occurs. > > Surely for `createFromString()` and `createEmpty()` we can make the > assumption that no character encoding is set, but I also suspect that a > possible majority of the times people use these functions they are likely > calling them when `createFragment()` is more appropriate, that they aren’t > supplying HTML documents with in-band text encoding information, and so > there’s a chance that de-emphasizing the parameter may be technically more > accurate and practically less helpful. Thanks for the insight. To be honest, I don't have hard feelings about the naming of the parameter. In a way you could say that override_encoding is still accurate, because the fallback default is UTF-8, so you override the fallback in a sense. As the documentation also emphasizes that the DOM extension works internally with UTF-8, this may align with expectations of programmers, but I'm not sure. I think we should really document the parameter very well in the docs. > > Love seeing all the continued work on this! > Thank you so much for your dedication to it. > > Dennis Snell Kind regards Niels -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: https://www.php.net/unsub.php