Re: Scope of XML parser rewrite?

Henri Sivonen Wed, 24 May 2017 00:12:38 -0700

On Tue, May 23, 2017 at 5:01 PM, Daniel Fath <[email protected]> wrote:
> So, if I understand this correctly - We'll first need to land this component
> in Firefox, right? And if it proves itself fine, then formalize it.

No, both the implementation and the spec would have to be pretty solid
before stuff can go into Firefox. But, as noted, DTDs are a blocker
(if Firefox is to use the same XML parser for both XUL and for the
Web, which makes sense in terms of binary size even if it's rather sad
for XUL to constrain the Web side).

>> I was thinking of having resolutions for the issues that are currently
>> warnings in red and multi-vendor buy-in. (Previously, Tab from Google
>> was interested in making SVG parsing non-Draconian, but I have no idea
>> how reflective of wider buy-in that remark was.)
>
> You also mentioned warnings in red and multi-vendor buy-in. What does that
> entail?

Looks like at this time, even Mozilla-internal buy-in is lacking. :-/

On Tue, May 23, 2017 at 9:23 PM, Eric Rahm <[email protected]> wrote:
> I was hoping to write a more thorough blog post about this proposal (I have
> some notes in a gist [1]), but for now I've added comments inline. The main
> takeaway here is that I want to do a bare-bones replacement of just the
> parts of expat we currently use. It needs to support DTD entities, have a
> streaming interface, and support XML 1 v4. That's it, no new features, no
> rewrite of our entire XML stack.

OK.

> Our current interface is UTF-16, so that's my target for now. I think
> whatever cache-friendliness would be lost converting from UTF-16 -> UTF-8 ->
> UTF-16.

I hope this can be reconsidered, because the assumption that it would
have to be UTF-16 -> UTF-8 -> UTF-16 isn't accurate.

encoding_rs (https://bugzilla.mozilla.org/show_bug.cgi?id=encoding_rs)
adds the capability to decode directly to UTF-8. This is a true
direct-to-UTF-8 capability without pivoting through UTF-16. When the
input is UTF-8 (as is the case with our chrome XML and with most
on-the-Web XML), in the streaming mode, except for the few bytes
representing code points split across buffer boundaries, this is fast
UTF-8 validation (without doing math to compute scalar values and with
SIMD acceleration for ASCII runs) and memcpy. (In the non-streaming
case, it's validation and borrow when called from Rust and validation
and nsStringBuffer refcount increment when called from C++.)

On the other side of the parser, it's true that our DOM API takes
UTF-16, but if all the code points in a text node are U+00veryFF or
under, the text gets stored with leading zeros omitted. It would be
fairly easy to add a hole in the abstraction to allow a UTF-8-oriented
parser to set the compressed form directly without expansion to UTF-16
and then compression immediately back to ASCII when the parser knows
that a text node is all ASCII.

For element and attribute names, we already support finding atoms by
UTF-8 representation and in most cases element and attribute names are
ASCII with static atoms already existing for them.

It seems to me that attribute values would be the only case where a
conversion from UTF-8 to UTF-16 would be needed all the time, and that
conversion can be fast for ASCII, which is what attribute values
mostly are.

Furthermore, the main Web XML case is SVG, which has relatively little
natural-language text, so it's almost entirely ASCII. Looking at the
ratio markup and natural-language text in XUL, it seems fair to guess
that parsing XUL as UTF-8 would be a cache-friendliness win, too.

-- 
Henri Sivonen
[email protected]
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-platform

Re: Scope of XML parser rewrite?

Reply via email to