Hi, On 09.12.20 08:59, Florian Schmaus wrote: > But the recipient would be able to apply the same rules regarding > localization as the sender when counting grapheme clusters.
Which rules? Unicode does not provide a locale specific grapheme clustering algorithm, TR29 only mentions that those exist and that it only provides a "default" algorithm that can be extended upon with locale specific rules. AFAIK there is not standard that properly defines grapheme clustering other than the TR29 algorithm which specifically declares to not create proper locale-specific grapheme clusters. The only thing we can do is say "do what TR29 says" (it actually gives two options, but lets just stick with extended grapheme clusters). However, TR29 itself does not make any statements regarding its stability and Unicode updates in the last years did change TR29 behavior even for existing codepoints. Thus if we rely on TR29 algorithm we need to specify a version of it, which in general is a bad idea. > I also suggest that the receiving side is considered. For example: > "Entities that receive character counted text should normalize the > counted text to Unicode Normalization Form C (NFC) [1] form prior > evaluating the character indexes." As I mentioned earlier, normalizing is changing the codepoints and thus (in XML layer) changing the transferred content. In my tests, I haven't seen any current server implementation doing that. Worst case, normalizing can result in messages getting unreadable to the receiving client that otherwise would have been readable (if the server has a newer unicode version than both client's fonts). So instead of adding client side behavior to handle servers doing modifications, I'd rather codify that servers SHOULD NOT modify the codepoints in <body>. Where we put this rule is another question. In my draft I specifically had the rule that if an entity applies normalization they have to update the indices if needed. This also applies to receiving entities which is incompatible with what you wrote (or at least I understand that you want to normalize without updating indices). Here is the rationale behind that: Normalization as per TR15 is considered stable, which means that as long as you only use codepoints that are defined in the Unicode version your code uses, any future Unicode/TR15 version will consider the string normalized. In other terms, this means that to ensure your client only sends normalized strings (which you would need to, so that any other entity can apply normalization without changing indices), you'd have to restrict your client to only send codepoint that are defined in the Unicode version it supports. However in practice, users have been sending codepoints that are not part of the Unicode specification implemented in their clients. This is because you can practically use new emojis (and their codepoints) as soon as they appear in popular fonts. Just to make an example: To support latest Emojis in Android apps, you can use the "EmojiCompat" support library (that includes a font with all emojis of the latest version) and thereby become able to display them. However, the supported Unicode version for all text processing still remains the version implemented by the ICU4J version shipped with the operating system. About 60% of Android devices currently in use have Android 9 or earlier and thus implement Unicode 10.0 or earlier (which was released mid 2017). Thus 60% of Android devices would not be able to correctly normalize messages that include the 🦠microbe emoji. Thus, in practice, sending clients cannot guarantee to send normalized strings without severely harming user experience by not accepting new codepoints. This also means that receiving clients cannot rely on receiving normalized messages or messages where indices refer to normalized messages. Marvin _______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________