On 12/7/20 11:34 PM, Marvin W wrote> On 07.12.20 19:34, Florian Schmaus wrote:
We do have xml:lang, don't we?

Unforunately, it doesn't help in all cases. It's perfectly fine to write
a message with xml:lang="en":

"chlapec" is "boy" in slowak

This is 27 grapheme clusters, but I guess most western people would
count it as 28.

But the recipient would be able to apply the same rules regarding localization as the sender when counting grapheme clusters.


Let us ignore grapheme clusters for a moment and focus on XEP-0426:
Have you considered Unicode normalization? Especially when a text
that was originally in decomposed form is normalized to composed
form. This would corrupt the code point indexes.

[..]

I think that due to this, XEP-0426 should specify that counting
happens with the text in NFC form. Or am I missing something?

I could imagine going for something like:

Yes, that definitely goes into the right direction.


Receiving or intermediary entities SHOULD not apply Unicode
normalization to the text referenced from character counting.

I am not sure that you can (or that we should) put normative text that applies to intermediate hops into XEP-0426. The XEP could/should limit itself to describe normative clauses for the point end-points exchanging character counting data.


If
entities apply Unicode normalization, they SHOULD update all
positions, indices and lengths derived from character counting if
required.

As above. I think this would need at least a discoverable disco#info feature. But even then, I doubt that this is useful in a normative form. However, it probably can not hurt to have XEP-0426 spell this out as recommendation in an informative way.


It is RECOMMENDED that entities creating the original
stanzas use NFC form.

Now that is the part I really like and which I believe to be missing from XEP-0426. +1

I also suggest that the receiving side is considered. For example: "Entities that receive character counted text should normalize the counted text to Unicode Normalization Form C (NFC) [1] form prior evaluating the character indexes."

1: https://unicode.org/reports/tr15/

- Florian

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Reply via email to