On Freitag, 4. Dezember 2020 21:33:38 CET Sam Whited wrote: > On Fri, Dec 4, 2020, at 20:23, Florian Schmaus wrote: > > I begin to feel that a lot of your rationale is based on the idea that > > you always (/often?) have access to the raw UTF-8 bytes as they > > appeared on the wire. > > Yes, most of it is. > > > While is is probably true for languages where the String type's native > > encoding is also UTF-8. It is usually not true for others. For > > example, widely used XML parser in Java will return Java's String > > type, which is UTF-16 (or ISO-8859-1 [1]) based. > > Yes, this is fair, I was thinking you could probably always get the raw > bytes, but it does look like a lot of these *only* do DOM based parsing > and don't keep the original representation.
This has nothing to do with DOM vs. whatever. SAX can also give you the data in the format which is described by the XML model (code points). So it appears there are two sides and arguing from the point of view of programming languages will give us always those who get the raw representation of the data on the wire (C-ish things) and those who get the high-level representation. Thus, I propose that we stick with what the standards offer. XMPP is based on XML in that all data exchanged is somehow wrapped in XML. XML specifies that all character data (text) is a sequence of unicode code points. The encoding on the wire is irrelevant after decoding of XML; on the *abstract* layer, XML provides sequences of code points, nothing else. Some libraries always convert to UTF-8 (libxml2), some bindings always offer some kind of unicode codepoints (e.g. python which opportunistically chooses ASCII/UCS-2/UCS-4 depending on the data), some bindings may even expose the raw bytes and let the user deal with it (I think there was/is a zero-copy implementation which mostly consisted of strategically replacing XML metacharacters with NUL bytes in the incoming data). But all implementations which want to be XMPP and XML 1.0 compliant need to have some way to convert or offer access to code points, as that’s the XML data model. Let’s build on that. Easy choice. Much easier than writing 20 emails on this topic, and that just in this thread. > > However, given that there is a wide variety here, I am not sure if it > > is worth to take any of that into consideration. > > Yes, fair enough. > > > Instead, my rationale is based on the idea that you always have > > access to the Unicode code points of the textual content obtained > > from the XML. > > I do not have that access without converting from UTF-8 to code points > in the hot-path where it would be inappropriate. It's effectively the > same thing: I don't want to convert from bytes to code points, you don't > want to convert from codepoints to bytes. Some languages will have to do > the conversion either way, so it seems worth using the thing that allows > for the most flexibility with the least amount of work in eg. IoT > devices using C that are trying to optimize for performance where > passing along the bytes as received on the wire (possibly with some > validation that the range is accurate) is acceptable. Note that you do not have to decode UTF-8 (which can be between O(n) and O(n^2) depending on the implementation and circumstances) to count code points; you can certainly do the counting in O(n) (which is the same as strlen() in C). And it would be similarly easy to write algorithms to do efficient batched codepoint indexed operations on UTF-8 strings in C (such as splitting UTF-8 byte ranges based on start/end information or such), if you really wanted to do such things in C. However, I also think that the IoT use-case is a bit strawmanny, given that IoT devices would rarely have to deal with markup or other rich human-facing formats which require decoding of such codepoint references. Thus ... I don’t buy this argument. Devices which render markup or references would have to deal with complexity way beyond this. And they’ll have to do the decoding anyway to do some kind of text rendering. kind regards, Jonas
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________