On 12/4/20 8:25 PM, Sam Whited wrote:
On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote:
My problem with your proposal is that it uses bytes. I don't get why
you want to use bytes here.

Naturally. Likewise my problem with your proposal is that it uses code
points and I don't get why you'd want to use them here :)

I begin to feel that a lot of your rationale is based on the idea that you always (/often?) have access to the raw UTF-8 bytes as they appeared on the wire.

While is is probably true for languages where the String type's native encoding is also UTF-8. It is usually not true for others. For example, widely used XML parser in Java will return Java's String type, which is UTF-16 (or ISO-8859-1 [1]) based. Then there is Python 3, where the str type is a sequence of Unicode characters (code points). Of course, it would be possible to design and implement XML parsers in Java and Python that return strings as they appeared in the parsed XML document/stream.

However, given that there is a wide variety here, I am not sure if it is worth to take any of that into consideration.

Instead, my rationale is based on the idea that you always have access to the Unicode code points of the textual content obtained from the XML. And I am in favor of code points because it allows us to aim for the extended grapheme cluster algorithm, while also allowing for the "simply count code points" fallback.

Note that both methods, counting grapheme (clusters) vs. counting codepoints, would, if I did not miss a grapheme cluster, yield the same result for this e-mail.

- Florian


1: Please ignore this. I have only mentioned it for completeness. If you are curious, lookup "JEP 254: Compact Strings".

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Reply via email to