[PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Niels Dossche Wed, 06 Sep 2023 13:26:46 -0700

Hi Dennis

On 06/09/2023 22:02, Dennis Snell wrote:
> 
> 
>> On Sep 4, 2023, at 1:15 PM, Niels Dossche <dossche.ni...@gmail.com> wrote:
>>
>> On 04/09/2023 21:54, Dennis Snell wrote:
>>> Thanks for the proposal Niels,
>>>
>>> I’ve dealt with my own grief working through issues in DOMDocument and 
>>> wanting it to work but finding it inadequate.
>>>
>>>> HTML5
>>>
>>> This would be a great starting point; I would love it if we took the 
>>> opportunity to fix named character reference decoding, as PHP has (to my 
>>> knowledge) never respected (at least in HTML5) that they decode differently 
>>> inside attributes as they do inside markup, considering rules such as the 
>>> ambiguous ampersand and decode errors.
>>>
>>> It’s also been frustrating that DOMDocument parses tags in RCDATA sections 
>>> where they don’t exist, such as in TITLE or TEXTAREA elements, escapes 
>>> certain types of invalid comments so that they appear rendered in the saved 
>>> document, and misses basic semantic rules (e.g. creating a BUTTON element 
>>> as a child of a BUTTON element instead of closing out the already-open 
>>> BUTTON).
>>
>> With this proposal: a real HTML5 parser, these above mentioned problems will 
>> fortunately be a problem from the past :)
> 
> Awesome. Makes me happy as long as we’re looking at a wholesale replacement 
> of the foundations upon which `DOMDocument` are built. My comment was mostly 
> to point out that there are levels to the inadequacy of `DOMDocument`; or 
> phrased differently, I support diverging from the `DOMDocument` class and 
> parser and even the interface. Making a break from the expectations of the 
> existing one could be nice to signal that it’s different, though I see that 
> full backwards compatibility is important to you.
>


I had a productive discussion with Tim this week about diverging from the 
interface to offer a nicer API.
It's highly possible I'll propose a change to the RFC later this week to make 
the API a bit cleaner, which will also signal that it's different. The 
backwards compatibility aspect will largely remain however. Especially the 
requirement that it should be opt-in is important to me.

>>
>>>
>>> I’d like to share some what a few of us have been working on inside 
>>> WordPress, which is to build a conformant streaming HTML5 parser:
>>>  - https://developer.wordpress.org/reference/classes/wp_html_tag_processor/ 
>>> <https://developer.wordpress.org/reference/classes/wp_html_tag_processor/>
>>>  - https://make.wordpress.org/core/2023/08/19/progress-report-html-api/ 
>>> <https://make.wordpress.org/core/2023/08/19/progress-report-html-api/>
>>>
>>> It’s just food for thought right now because adding HTML5 support to 
>>> DOMDocument would benefit everyone, but we decided we had common need in 
>>> PHP to work with HTML not in a DOM, but in a streaming fashion, one with 
>>> very little runtime overhead. My long-term plan has been to get a good 
>>> grasp for the interface needs and thoroughly test it within the WordPress 
>>> community and then propose its inclusion into PHP. It’s been incredibly 
>>> handy so far, and on my laptop runs at around 20 MB/s, which is not great, 
>>> but good enough for many needs. My naive C port runs on the same laptop at 
>>> around 80 MB/s and I believe that we can likely triple or quadruple that 
>>> speed again if any of us working on it knew how to take advantage of SIMD 
>>> instrinsics.
>>>
>>> It tries to accomplish a few goals:
>>>  - be fast enough
>>>  - interpret HTML as an HTML5-compliant browser will
>>>  - find specific locations within an HTML document and then read or modify 
>>> them
>>>  - pass through any invalid HTML it encounters for the browser to 
>>> resolve/fix unless modifying the part of the document containing those 
>>> invalid constructions
>>>
>>
>> I've seen someone link this on Reddit today, it's a really nice project!
>> It reminds me of Cloudflare's lol-html, which is also a streaming parser 
>> used to modify and sanitize documents linearly.
>> I believe this could be a great addition, it solves a different problem that 
>> the ext/dom extension solves. So I think it would be a great complementary 
>> addition.
> 
> Unfortunately we only found the Cloudflare project after building our “Tag 
> Processor” but the similarities are striking. Having this kind of interface 
> inside PHP would do wonders for the WordPress world, and I think it would be 
> great for many other projects.
> 

Absolutely!

>>
>>> I only bring up this different interface because once we started digging 
>>> deep into DOMDocument we found that the problems with it were far from 
>>> superficial; that there is a host of problems and a mismatched interface to 
>>> our common needs. It has surprised me that PHP, the language of the web, 
>>> has had such trouble handling HTML, the language of the web, and we wanted 
>>> to completely resolve this issue once and for all within WordPress so we 
>>> can clean up decades’ old problems with encoding, decoding, security, and 
>>> sanitization.
>>
>> Yes, I was also quite surprised of the lacking support for modern web 
>> features, and also the problems with spec compliance.
>> I only recently got into maintaining ext/dom. So there's still a lot of work 
>> to do.
>> I had already started with adding more DOM APIs in the 8.3 release cycle and 
>> plan to continue that effort in 8.4.
>> Another major project I want to do for 8.4, besides HTML5 support, is fixing 
>> the spec compliance issues in an opt-in manner. This would help with 
>> security & sanitization problems (HTML5 should help with the 
>> encoding&decoding).
>>
>>>
>>> Warmly,
>>> Dennis Snell
>>
>> Kind regards
>> Niels
>>
>>>
>>>> On Sep 2, 2023, at 12:41 PM, Niels Dossche <dossche.ni...@gmail.com 
>>>> <mailto:dossche.ni...@gmail.com>> wrote:
>>>>
>>>> I'm opening the discussion for my RFC "DOM HTML5 parsing and serialization 
>>>> support".
>>>> https://wiki.php.net/rfc/domdocument_html5_parser 
>>>> <https://wiki.php.net/rfc/domdocument_html5_parser>
>>>>
>>>> Kind regards
>>>> Niels
>>
> 
> Impressive proposal. It will be nice to have. Did you consider any tricks for 
> text encoding, such as converting non-utf8 documents into utf8 first before 
> parsing? Was wondering if we did that if we could lean on `iconv` and save 
> the extra data in the library, if that’s important enough.

The new class performs BOM sniffing and encoding prescanning. Prescanning means 
the charset meta attribute is used to determine the encoding if there are no 
BOM bytes. The internal representation of the document is UTF-8, so there's 
possibly a conversion when the original encoding is not UTF-8. When you use the 
save methods, the data is returned in the original encoding (i.e. converted 
back from UTF-8 to whatever the original encoding was).

It's worth noting that most HTML documents appear to be UTF-8 nowadays 
(https://en.wikipedia.org/wiki/UTF-8#/media/File:UTF-8_takes_over.png). So 
there's a performance trick we use for UTF-8 documents: when loading a UTF-8 
document, we don't decode into unicode codepoints and back into UTF-8, we only 
perform validation. When an invalid sequence is detected, the unicode 
replacement character is outputted instead. This is equivalent to decoding and 
re-encoding, but is much faster.

Unicode character encoding data is stored within the Lexbor library.

> 
> Cheers,
> Dennis Snell
> 

Kind regards
Niels

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

[PHP-DEV] Re: [RFC] [Discussion] DOM HTML5 parsing and serialization support

Reply via email to