On 10/25/19 3:15 PM, Sam Whited wrote:
On Thu, Oct 24, 2019, at 18:32, Marvin W wrote:
XMPP uses UTF-8, and there's almost no reason to use anything but UTF-8.

I do agree that this is true inside XMPP, but the data being transported inside XMPP might be transcoded to non-xmpp transport (examples: bridges to other networks, clients that don't do XMPP on c2s connections) and for those use-cases different encodings might occur. We shouldn't focus on non-UTF-8 encodings, but considering it also doesn't hurt.
This problem exists with codepoints too, though to a lesser extent and
it may be less clear how it should be handled in all cases. For example,
in the middle of a multi-codepoint emoji or country flag.

Yes and no. multi-codepoint emojis are still valid characters when split, whereas multi-byte codepoints cannot be split. There is nothing wrong with displaying the flag 🇪🇺 as 🇪​🇺 *, so your implementation is always capable in strictly following any markup being done on a codepoint basis, even if the markup border is inside a multi-codepoint emoji.

There's also the minor problem of having to decode all the bytes up to
the start position at the application layer if we have to count
codepoints.

Some programming languages handle strings in unicode codepoints instead of bytes. I agree that this would be an issue for non messaging content (i.e. large files) but I don't think we are talking about. For messaging content, it's no issue that the client has two decode all the bytes - it will be required to do so anyway for displaying.

With bytes you only have two checks: is the start and the
end marker on a byte boundary? If so the string in the middle can be
assumed to be valid.

Assuming you meant codepoint boundary instead of byte boundary, I agree that this would also be an option, as long as we make sure people actually do these checks. I personally prefer codepoints, but both are valid and sane options - as long as we don't go with grapheme cluster or any like this, we are fine IMO.

Marvin

--

* I put a zero-width space in there to ensure your mail client is not going to merge the two characters.
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
_______________________________________________

Reply via email to