On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote: > Often you don't get raw bytes from your XML parser, but an instance of > your programming language's native String type. But often your > programming language provides an API to encode that String to UTF-8 > encoded bytes, which *should* match exactly the bytes on the wire.
That would also be expensive to do every time and I'd be willing to be the XML parser *also* gives you the ability to get bytes. Otherwise what would it do with XML documents that don't use the same encoding as your language (again, I know we always use UTF-8, but an XML parser won't know that and may have to deal with other encodings)? Would it always implicitly convert every single thing? That seems like it will be a potentially very slow XML parser if it doesn't have a fallback for me to say "just give me the raw bytes". > My problem with your proposal is that it uses bytes. I don't get why > you want to use bytes here. Naturally. Likewise my problem with your proposal is that it uses code points and I don't get why you'd want to use them here :) > You most certainly will obtain from your XML parser a type that can > be converted to a sequence of Unicode code points. Right, which is probably UTF-8 encoded bytes. If I have to convert them all to a series of unicode codepoints which is more expensive. If I have bytes to begin with I have to check if the values at the start/end of the range are valid UTF-8 (one of the nice properties of UTF-8 is you can know if you're at the start of a character without parsing the whole string) instead of having to convert everything up to the end. Then I can ignore all the bits in the middle and deal with them later outside of the hot path if/when I convert it to a string or whatever for display. > Hence I think your proposal should use code points instead. And then, > if I am not mistaken, your proposal matches my proposal for > opportunistic interoperability as fallback. You may be right that it's the same as far as fallback goes. I suspect that more things will have a UTF-8 to whatever they are conversion than a UTF-32 to whatever they are conversion, but to be fair I have no proof for that. Out of curiosity, can you provide an example of an XML decoder that can *only* give you an instance of a UTF-32 string (or whatever the language/OS uses)? I can give plenty (the Go one for starters) where you only get bytes out and it's up to you to figure out what to do with them. I *could* convert those to a UTF- 32 slice, but that would be unnecessary and expensive in a language designed for performance whereas if it's a language that's doing implicit conversion to its own thing it's already doing implicit work and probably isn't optimizing for the kind of fast-path performance I'd like to get. I think I should simplify my argument to: most things use UTF-8 or at least can convert from UTF-8 so we should too. Using codepoints is effectively using UTF-32, which most things [citation needed] don't use by default. —Sam _______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________