Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
XML is a sequence of characters (not bytes.) References mark a portion of displayed text which is rendered as a sequence of characters (not bytes.) So it makes perfect sense to define references in terms of bytes. ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
I believe this is a mischaracterization of my argument. My argument is "everything will have a way to get at the underlying bytes, not everything will have them pre-converted into code points". Also "this gives us the option to do certain optimizations on systems that support them, but using code points doesn't so we should do the thing that is the most flexible". —Sam On Wed, Dec 9, 2020, at 19:09, Tedd Sterr wrote: > Regardless, your argument is still "bytes is more convenient for me, > so everyone else should do what's best for me." I don't think that's a > good argument. ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
>> The decoding _should_ be done upfront - that's how you get a valid XML >> document. > I don't think this is true. XML is defined as UTF-8 (in this case), > which is a collection of bytes. They don't have to be separated out and > transformed into some higher representation of code points. Just because > Python et al. convert things into UTF-32 strings first doesn't mean > everything has to. > > Regardless of what language you're using it's trivial to deal with this > as a UTF-8 byte stream, it is not always trivial to handle this as a UTF- > 32 integer stream as the example shows. XML is defined as a sequence of characters, it doesn't specify how those character must be encoded (though it does require support for both UTF-8 and UTF-16.) UTF-7/8/16/32 are encoding schemes, not character representations - people do make the mistake of conflating the two things, but that doesn't mean they are the same. Unicode doesn't specify the size of characters - they don't have a specific bit-width, they are as large as required; the encoding scheme is then a method to transform characters into a sequence of bytes. It shouldn't matter what encoding scheme is used - UTF-8, UTF-16, ISO-8859-9, ISO-2022-JP, Shift_JIS, EBCDIC, are all possibilities - because you're supposed to decode the data into characters before doing anything it. The fact that you're able to take advantage of the foreknowledge of your data being encoded using UTF-8 is purely because XMPP happens to define it that way, not because XML is defined using any specific encoding scheme. Basing your entire implementation around the expectation of UTF-8 allows you to take some convenient short-cuts, but much of that only works because XML markup uses ASCII-compatible characters, which conveniently have an equivalent single-byte representation when encoded as UTF-8; if it were almost any other encoding then it simply wouldn't work without some form of decoding first. If you insist on not decoding and then run into difficulties with handling characters because you're purposely avoiding handling characters while simultaneously using XML which is defined as a sequence of characters, an appropriate response is "what did you expect?" It's not trivial to handle everything as UTF-8 in implementations where the application receives already decoded strings (a sequence of characters, not bytes) from the XML parser. The most likely approach to dealing with that will be to re-encode the already decoded data back into UTF-8 just to deal with the offsets, which is precisely the kind of inefficient processing you're suggesting should be avoided. And considering the whole purpose of references is for marking sequences of characters, those characters are going to be decoded at some point; you're trying to avoid decoding early, while still validating offsets, so that the decoding can be done later anyway. Regardless, your argument is still "bytes is more convenient for me, so everyone else should do what's best for me." I don't think that's a good argument. ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
I don't think this is true. XML is defined as UTF-8 (in this case), which is a collection of bytes. They don't have to be separated out and transformed into some higher representation of code points. Just because Python et al. convert things into UTF-32 strings first doesn't mean everything has to. Regardless of what language you're using it's trivial to deal with this as a UTF-8 byte stream, it is not always trivial to handle this as a UTF- 32 integer stream as the example shows. —Sam On Wed, Dec 9, 2020, at 14:03, Tedd Sterr wrote: > The decoding _should_ be done upfront - that's how you get a valid XML > document. ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
For the record: On Dienstag, 8. Dezember 2020 23:13:08 CET Sam Whited wrote: > I don't understand how this is part of the XML data model. Do you mean > that only Unicode encodings are supported by XML? If so, that's fair and > removes one of my arguments, I did not know that was the case. However, > I still think the data on the wire should describe the other data on the > wire, not some higher- level "decoded" representation that many XML > libraries may not even use. Let me dig up the references: https://www.w3.org/TR/REC-xml/#charsets > [Definition: A parsed entity contains text, a sequence of characters, which may represent markup or character data.] text = sequence of characters, representing markup or character data https://www.w3.org/TR/REC-xml/#syntax > [Definition: All text that is not markup constitutes the character data of the document.] Ok, so we have text which is a sequence of characters, and what isn’t markup is character data. Now what are characters in XML? Back to: https://www.w3.org/TR/REC-xml/#charsets > [Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of these standards cited in A.1 Normative References were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, XML processors MUST accept any character in the range specified for Char. ] That is the definition of a subset of the Unicode code point range: > [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | > [#xE000- #xFFFD] | [#x1-#x10]/* any Unicode character, excluding the surrogate blocks, FFFE, and . */ kind regards, Jonas ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
Sam, your argument appears to be "I want to handle everything as bytes without doing any string decoding, so any other option would be more effort (less efficient) for me." XML is defined as a sequence of characters, not bytes - those characters subsequently need to be transformed into bytes for the purpose of storage/transmission, and that's defined by the encoding scheme (UTF-8 in this case.) Bytes is convenient for you, but not for everyone else using a language that does the decoding upfront. The decoding _should_ be done upfront - that's how you get a valid XML document. If you're trying to handle XML without first decoding from UTF-8 so you can save a few clock-cycles, that's cool, but you are going to run into awkward annoyances when it comes to trying to handle such alien concepts as characters. The reason you can mostly get away with not decoding is because the lower half of ASCII is represented the same way when using UTF-8, so you can pretend the XML tags are encoded as ASCII characters and just treat any Unicode strings as opaque binary blobs - but that is only a convenient hack. If everyone else is to go along with your convenient hack, that only means they will have to deal with their own awkward annoyances because they made the terrible decision to decode strings before handling them (as if that's what you're actually supposed to do.) ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
To try and show why I'm pushing back on this so hard here is an example of doing this three different ways: one assuming the references are bytes, two assuming the references are code points. https://play.golang.org/p/kKbr2hXd56U The third one I was forgetting I can do, and it looks quite nice (if we ignore the performance cost as people seem to want to do) but we can't do any error handling for reasons explained in the comments. If we're a client this may not matter, it's not the end of the world if we show the user a reference that starts or ends with an ugly error character box or something, if we're the server this might matter more, either way, I think having a sane way to do error handling on bad references is a requirement: Of course, this is Go specific but the solutions probably look similar in other C-like languages. I should also note that this is using a higher level decoding API than I am using, but it doesn't matter since the extra boilerplate required to do this at the lower- level where you get byte slices out would look the same for the first two examples. However it would require extra work for me to do the third example (because it would give me []byte, not a string) which makes it even less practical and the third example isn't a convenience that exists in eg. C, so generally it's worth just ignoring. If I'm having to pick between the code in the first and second example, please let me pick the first. —Sam On Tue, Dec 8, 2020, at 22:13, Sam Whited wrote: > The XML library I use does not give me a string or slice of code > points, it gives me a slice of bytes because that's the level I'm > operating at. Even at the higher level if I decode the bytes into a > string (A Go string in this case), that is still just a slice of UTF-8 > bytes (it does not decode them, ensure they're valid, and turn them > into a slice of code points, that is a very expensive operation that > it avoids until you need it or explicitly do it yourself). > > I don't understand how this is part of the XML data model. Do you mean > that only Unicode encodings are supported by XML? If so, that's fair > and removes one of my arguments, I did not know that was the case. > However, I still think the data on the wire should describe the other > data on the wire, not some higher- level "decoded" representation that > many XML libraries may not even use. > > —Sam > > On Tue, Dec 8, 2020, at 21:32, Jonas Schäfer wrote: > > But all implementations which want to be XMPP and XML 1.0 compliant > > need to have some way to convert or offer access to code points, as > > that’s the XML data model. Let’s build on that. > > > > Easy choice. > > > > Much easier than writing 20 emails on this topic, and that just in > > this thread. > ___ > Standards mailing list Info: > https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: Standards- > unsubscr...@xmpp.org > ___ > -- Sam Whited ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
Hi, On 09.12.20 08:59, Florian Schmaus wrote: > But the recipient would be able to apply the same rules regarding > localization as the sender when counting grapheme clusters. Which rules? Unicode does not provide a locale specific grapheme clustering algorithm, TR29 only mentions that those exist and that it only provides a "default" algorithm that can be extended upon with locale specific rules. AFAIK there is not standard that properly defines grapheme clustering other than the TR29 algorithm which specifically declares to not create proper locale-specific grapheme clusters. The only thing we can do is say "do what TR29 says" (it actually gives two options, but lets just stick with extended grapheme clusters). However, TR29 itself does not make any statements regarding its stability and Unicode updates in the last years did change TR29 behavior even for existing codepoints. Thus if we rely on TR29 algorithm we need to specify a version of it, which in general is a bad idea. > I also suggest that the receiving side is considered. For example: > "Entities that receive character counted text should normalize the > counted text to Unicode Normalization Form C (NFC) [1] form prior > evaluating the character indexes." As I mentioned earlier, normalizing is changing the codepoints and thus (in XML layer) changing the transferred content. In my tests, I haven't seen any current server implementation doing that. Worst case, normalizing can result in messages getting unreadable to the receiving client that otherwise would have been readable (if the server has a newer unicode version than both client's fonts). So instead of adding client side behavior to handle servers doing modifications, I'd rather codify that servers SHOULD NOT modify the codepoints in . Where we put this rule is another question. In my draft I specifically had the rule that if an entity applies normalization they have to update the indices if needed. This also applies to receiving entities which is incompatible with what you wrote (or at least I understand that you want to normalize without updating indices). Here is the rationale behind that: Normalization as per TR15 is considered stable, which means that as long as you only use codepoints that are defined in the Unicode version your code uses, any future Unicode/TR15 version will consider the string normalized. In other terms, this means that to ensure your client only sends normalized strings (which you would need to, so that any other entity can apply normalization without changing indices), you'd have to restrict your client to only send codepoint that are defined in the Unicode version it supports. However in practice, users have been sending codepoints that are not part of the Unicode specification implemented in their clients. This is because you can practically use new emojis (and their codepoints) as soon as they appear in popular fonts. Just to make an example: To support latest Emojis in Android apps, you can use the "EmojiCompat" support library (that includes a font with all emojis of the latest version) and thereby become able to display them. However, the supported Unicode version for all text processing still remains the version implemented by the ICU4J version shipped with the operating system. About 60% of Android devices currently in use have Android 9 or earlier and thus implement Unicode 10.0 or earlier (which was released mid 2017). Thus 60% of Android devices would not be able to correctly normalize messages that include the 🦠 microbe emoji. Thus, in practice, sending clients cannot guarantee to send normalized strings without severely harming user experience by not accepting new codepoints. This also means that receiving clients cannot rely on receiving normalized messages or messages where indices refer to normalized messages. Marvin ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___
Re: [Standards] Proposed XMPP Extension: Character counting in message bodies
On 12/7/20 11:34 PM, Marvin W wrote> On 07.12.20 19:34, Florian Schmaus wrote: We do have xml:lang, don't we? Unforunately, it doesn't help in all cases. It's perfectly fine to write a message with xml:lang="en": "chlapec" is "boy" in slowak This is 27 grapheme clusters, but I guess most western people would count it as 28. But the recipient would be able to apply the same rules regarding localization as the sender when counting grapheme clusters. Let us ignore grapheme clusters for a moment and focus on XEP-0426: Have you considered Unicode normalization? Especially when a text that was originally in decomposed form is normalized to composed form. This would corrupt the code point indexes. [..] I think that due to this, XEP-0426 should specify that counting happens with the text in NFC form. Or am I missing something? I could imagine going for something like: Yes, that definitely goes into the right direction. Receiving or intermediary entities SHOULD not apply Unicode normalization to the text referenced from character counting. I am not sure that you can (or that we should) put normative text that applies to intermediate hops into XEP-0426. The XEP could/should limit itself to describe normative clauses for the point end-points exchanging character counting data. If entities apply Unicode normalization, they SHOULD update all positions, indices and lengths derived from character counting if required. As above. I think this would need at least a discoverable disco#info feature. But even then, I doubt that this is useful in a normative form. However, it probably can not hurt to have XEP-0426 spell this out as recommendation in an informative way. It is RECOMMENDED that entities creating the original stanzas use NFC form. Now that is the part I really like and which I believe to be missing from XEP-0426. +1 I also suggest that the receiving side is considered. For example: "Entities that receive character counted text should normalize the counted text to Unicode Normalization Form C (NFC) [1] form prior evaluating the character indexes." 1: https://unicode.org/reports/tr15/ - Florian OpenPGP_signature Description: OpenPGP digital signature ___ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org ___