Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Tedd Sterr
XML is a sequence of characters (not bytes.)

References mark a portion of displayed text which is rendered as a sequence of 
characters (not bytes.)

So it makes perfect sense to define references in terms of bytes.

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Sam Whited
I believe this is a mischaracterization of my argument. My argument is
"everything will have a way to get at the underlying bytes, not
everything will have them pre-converted into code points". Also "this
gives us the option to do certain optimizations on systems that support
them, but using code points doesn't so we should do the thing that is
the most flexible".

—Sam

On Wed, Dec 9, 2020, at 19:09, Tedd Sterr wrote:
> Regardless, your argument is still "bytes is more convenient for me,
> so everyone else should do what's best for me." I don't think that's a
> good argument.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Tedd Sterr
>> The decoding _should_ be done upfront - that's how you get a valid XML 
>> document.

> I don't think this is true. XML is defined as UTF-8 (in this case),
> which is a collection of bytes. They don't have to be separated out and
> transformed into some higher representation of code points. Just because
> Python et al. convert things into UTF-32 strings first doesn't mean
> everything has to.
>
> Regardless of what language you're using it's trivial to deal with this
> as a UTF-8 byte stream, it is not always trivial to handle this as a UTF-
> 32 integer stream as the example shows.

XML is defined as a sequence of characters, it doesn't specify how those 
character must be encoded (though it does require support for both UTF-8 and 
UTF-16.) UTF-7/8/16/32 are encoding schemes, not character representations - 
people do make the mistake of conflating the two things, but that doesn't mean 
they are the same.

Unicode doesn't specify the size of characters - they don't have a specific 
bit-width, they are as large as required; the encoding scheme is then a method 
to transform characters into a sequence of bytes. It shouldn't matter what 
encoding scheme is used - UTF-8, UTF-16, ISO-8859-9, ISO-2022-JP, Shift_JIS, 
EBCDIC, are all possibilities - because you're supposed to decode the data into 
characters before doing anything it.

The fact that you're able to take advantage of the foreknowledge of your data 
being encoded using UTF-8 is purely because XMPP happens to define it that way, 
not because XML is defined using any specific encoding scheme. Basing your 
entire implementation around the expectation of UTF-8 allows you to take some 
convenient short-cuts, but much of that only works because XML markup uses 
ASCII-compatible characters, which conveniently have an equivalent single-byte 
representation when encoded as UTF-8; if it were almost any other encoding then 
it simply wouldn't work without some form of decoding first. If you insist on 
not decoding and then run into difficulties with handling characters because 
you're purposely avoiding handling characters while simultaneously using XML 
which is defined as a sequence of characters, an appropriate response is "what 
did you expect?"

It's not trivial to handle everything as UTF-8 in implementations where the 
application receives already decoded strings (a sequence of characters, not 
bytes) from the XML parser. The most likely approach to dealing with that will 
be to re-encode the already decoded data back into UTF-8 just to deal with the 
offsets, which is precisely the kind of inefficient processing you're 
suggesting should be avoided. And considering the whole purpose of references 
is for marking sequences of characters, those characters are going to be 
decoded at some point; you're trying to avoid decoding early, while still 
validating offsets, so that the decoding can be done later anyway.

Regardless, your argument is still "bytes is more convenient for me, so 
everyone else should do what's best for me." I don't think that's a good 
argument.

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Sam Whited
I don't think this is true. XML is defined as UTF-8 (in this case),
which is a collection of bytes. They don't have to be separated out and
transformed into some higher representation of code points. Just because
Python et al. convert things into UTF-32 strings first doesn't mean
everything has to.

Regardless of what language you're using it's trivial to deal with this
as a UTF-8 byte stream, it is not always trivial to handle this as a UTF-
32 integer stream as the example shows.

—Sam

On Wed, Dec 9, 2020, at 14:03, Tedd Sterr wrote:
> The decoding _should_ be done upfront - that's how you get a valid XML
> document.
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Jonas Schäfer
For the record:

On Dienstag, 8. Dezember 2020 23:13:08 CET Sam Whited wrote:
> I don't understand how this is part of the XML data model. Do you mean
> that only Unicode encodings are supported by XML? If so, that's fair and
> removes one of my arguments, I did not know that was the case. However,
> I still think the data on the wire should describe the other data on the
> wire, not some higher- level "decoded" representation that many XML
> libraries may not even use.

Let me dig up the references:

https://www.w3.org/TR/REC-xml/#charsets

> [Definition: A parsed entity contains text, a sequence of characters, which 
may represent markup or character data.]

text = sequence of characters, representing markup or character data

https://www.w3.org/TR/REC-xml/#syntax

>  [Definition: All text that is not markup constitutes the character data of 
the document.] 

Ok, so we have text which is a sequence of characters, and what isn’t markup 
is character data.

Now what are characters in XML? Back to: 
https://www.w3.org/TR/REC-xml/#charsets

> [Definition: A character is an atomic unit of text as specified by ISO/IEC 
10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line 
feed, and the legal characters of Unicode and ISO/IEC 10646. The versions of 
these standards cited in A.1 Normative References were current at the time 
this document was prepared. New characters may be added to these standards by 
amendments or new editions. Consequently, XML processors MUST accept any 
character in the range specified for Char. ] 

That is the definition of a subset of the Unicode code point range:

> [2]   Char   ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | 
> [#xE000-
#xFFFD] | [#x1-#x10]/* any Unicode character, excluding the 
surrogate blocks, FFFE, and . */

kind regards,
Jonas


___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Tedd Sterr
Sam, your argument appears to be "I want to handle everything as bytes without 
doing any string decoding, so any other option would be more effort (less 
efficient) for me."

XML is defined as a sequence of characters, not bytes - those characters 
subsequently need to be transformed into bytes for the purpose of 
storage/transmission, and that's defined by the encoding scheme (UTF-8 in this 
case.) Bytes is convenient for you, but not for everyone else using a language 
that does the decoding upfront. The decoding _should_ be done upfront - that's 
how you get a valid XML document.

If you're trying to handle XML without first decoding from UTF-8 so you can 
save a few clock-cycles, that's cool, but you are going to run into awkward 
annoyances when it comes to trying to handle such alien concepts as characters. 
The reason you can mostly get away with not decoding is because the lower half 
of ASCII is represented the same way when using UTF-8, so you can pretend the 
XML tags are encoded as ASCII characters and just treat any Unicode strings as 
opaque binary blobs - but that is only a convenient hack. If everyone else is 
to go along with your convenient hack, that only means they will have to deal 
with their own awkward annoyances because they made the terrible decision to 
decode strings before handling them (as if that's what you're actually supposed 
to do.)

___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Sam Whited
To try and show why I'm pushing back on this so hard here is an example
of doing this three different ways: one assuming the references are
bytes, two assuming the references are code points.

https://play.golang.org/p/kKbr2hXd56U

The third one I was forgetting I can do, and it looks quite nice (if we
ignore the performance cost as people seem to want to do) but we can't
do any error handling for reasons explained in the comments. If we're a
client this may not matter, it's not the end of the world if we show the
user a reference that starts or ends with an ugly error character box or
something, if we're the server this might matter more, either way, I
think having a sane way to do error handling on bad references is a
requirement:

Of course, this is Go specific but the solutions probably look similar
in other C-like languages. I should also note that this is using a
higher level decoding API than I am using, but it doesn't matter since
the extra boilerplate required to do this at the lower- level where you
get byte slices out would look the same for the first two examples.
However it would require extra work for me to do the third example
(because it would give me []byte, not a string) which makes it even less
practical and the third example isn't a convenience that exists in eg.
C, so generally it's worth just ignoring.

If I'm having to pick between the code in the first and second example,
please let me pick the first.

—Sam

On Tue, Dec 8, 2020, at 22:13, Sam Whited wrote:
> The XML library I use does not give me a string or slice of code
> points, it gives me a slice of bytes because that's the level I'm
> operating at. Even at the higher level if I decode the bytes into a
> string (A Go string in this case), that is still just a slice of UTF-8
> bytes (it does not decode them, ensure they're valid, and turn them
> into a slice of code points, that is a very expensive operation that
> it avoids until you need it or explicitly do it yourself).
>
> I don't understand how this is part of the XML data model. Do you mean
> that only Unicode encodings are supported by XML? If so, that's fair
> and removes one of my arguments, I did not know that was the case.
> However, I still think the data on the wire should describe the other
> data on the wire, not some higher- level "decoded" representation that
> many XML libraries may not even use.
>
> —Sam
>
> On Tue, Dec 8, 2020, at 21:32, Jonas Schäfer wrote:
> > But all implementations which want to be XMPP and XML 1.0 compliant
> > need to have some way to convert or offer access to code points, as
> > that’s the XML data model. Let’s build on that.
> >
> > Easy choice.
> >
> > Much easier than writing 20 emails on this topic, and that just in
> > this thread.
> ___
> Standards mailing list Info:
> https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: Standards-
> unsubscr...@xmpp.org
> ___
>

-- 
Sam Whited
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Marvin W
Hi,

On 09.12.20 08:59, Florian Schmaus wrote:
> But the recipient would be able to apply the same rules regarding
> localization as the sender when counting grapheme clusters.

Which rules? Unicode does not provide a locale specific grapheme
clustering algorithm, TR29 only mentions that those exist and that it
only provides a "default" algorithm that can be extended upon with
locale specific rules. AFAIK there is not standard that properly defines
grapheme clustering other than the TR29 algorithm which specifically
declares to not create proper locale-specific grapheme clusters. The
only thing we can do is say "do what TR29 says" (it actually gives two
options, but lets just stick with extended grapheme clusters). However,
TR29 itself does not make any statements regarding its stability and
Unicode updates in the last years did change TR29 behavior even for
existing codepoints. Thus if we rely on TR29 algorithm we need to
specify a version of it, which in general is a bad idea.

> I also suggest that the receiving side is considered. For example:
> "Entities that receive character counted text should normalize the
> counted text to Unicode Normalization Form C (NFC) [1] form prior
> evaluating the character indexes."

As I mentioned earlier, normalizing is changing the codepoints and thus
(in XML layer) changing the transferred content. In my tests, I haven't
seen any current server implementation doing that. Worst case,
normalizing can result in messages getting unreadable to the receiving
client that otherwise would have been readable (if the server has a
newer unicode version than both client's fonts). So instead of adding
client side behavior to handle servers doing modifications, I'd rather
codify that servers SHOULD NOT modify the codepoints in . Where we
put this rule is another question.

In my draft I specifically had the rule that if an entity applies
normalization they have to update the indices if needed. This also
applies to receiving entities which is incompatible with what you wrote
(or at least I understand that you want to normalize without updating
indices).

Here is the rationale behind that:
Normalization as per TR15 is considered stable, which means that as long
as you only use codepoints that are defined in the Unicode version your
code uses, any future Unicode/TR15 version will consider the string
normalized. In other terms, this means that to ensure your client only
sends normalized strings (which you would need to, so that any other
entity can apply normalization without changing indices), you'd have to
restrict your client to only send codepoint that are defined in the
Unicode version it supports.
However in practice, users have been sending codepoints that are not
part of the Unicode specification implemented in their clients. This is
because you can practically use new emojis (and their codepoints) as
soon as they appear in popular fonts.

Just to make an example: To support latest Emojis in Android apps, you
can use the "EmojiCompat" support library (that includes a font with all
emojis of the latest version) and thereby become able to display them.
However, the supported Unicode version for all text processing still
remains the version implemented by the ICU4J version shipped with the
operating system. About 60% of Android devices currently in use have
Android 9 or earlier and thus implement Unicode 10.0 or earlier (which
was released mid 2017). Thus 60% of Android devices would not be able to
correctly normalize messages that include the 裂 microbe emoji. Thus, in
practice, sending clients cannot guarantee to send normalized strings
without severely harming user experience by not accepting new
codepoints. This also means that receiving clients cannot rely on
receiving normalized messages or messages where indices refer to
normalized messages.

Marvin
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___


Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

2020-12-09 Thread Florian Schmaus
On 12/7/20 11:34 PM, Marvin W wrote> On 07.12.20 19:34, Florian Schmaus 
wrote:

We do have xml:lang, don't we?


Unforunately, it doesn't help in all cases. It's perfectly fine to write
a message with xml:lang="en":

"chlapec" is "boy" in slowak

This is 27 grapheme clusters, but I guess most western people would
count it as 28.


But the recipient would be able to apply the same rules regarding 
localization as the sender when counting grapheme clusters.




Let us ignore grapheme clusters for a moment and focus on XEP-0426:
Have you considered Unicode normalization? Especially when a text
that was originally in decomposed form is normalized to composed
form. This would corrupt the code point indexes.

[..]

I think that due to this, XEP-0426 should specify that counting
happens with the text in NFC form. Or am I missing something?


I could imagine going for something like:


Yes, that definitely goes into the right direction.



Receiving or intermediary entities SHOULD not apply Unicode
normalization to the text referenced from character counting.


I am not sure that you can (or that we should) put normative text that 
applies to intermediate hops into XEP-0426. The XEP could/should limit 
itself to describe normative clauses for the point end-points exchanging 
character counting data.




If
entities apply Unicode normalization, they SHOULD update all
positions, indices and lengths derived from character counting if
required.


As above. I think this would need at least a discoverable disco#info 
feature. But even then, I doubt that this is useful in a normative form. 
However, it probably can not hurt to have XEP-0426 spell this out as 
recommendation in an informative way.




It is RECOMMENDED that entities creating the original
stanzas use NFC form.


Now that is the part I really like and which I believe to be missing 
from XEP-0426. +1


I also suggest that the receiving side is considered. For example: 
"Entities that receive character counted text should normalize the 
counted text to Unicode Normalization Form C (NFC) [1] form prior 
evaluating the character indexes."


1: https://unicode.org/reports/tr15/

- Florian



OpenPGP_signature
Description: OpenPGP digital signature
___
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: standards-unsubscr...@xmpp.org
___