Ivan, string with unpaired surrogates symbols are serialized and deserialized by java UTF-8 decoder successfully but the result does not match the initial string. It may result in that if the user's login contains these symbols, it will be distorted after deserialization and the user will not be able to log in. I understand that it is a quite rare case. Anyway, the way to solve this problem was introduced here - https://issues.apache.org/jira/browse/IGNITE-3098

Frankly, it is not the topic I would like to discuss now. The main question is - should we restrict the join of nodes with different encodings or just fix all places where implicit default encoding is used and specify the explicit one as Ivan Daschinsky suggested?

From my point of view, it is better to reject nodes with different encodings (especially after Ilya Kasnacheev mentioned that we already have a warning  "Differing character encodings across cluster may lead to erratic behavior"). It will help to avoid "erratic behavior", not just warn about it. It is important since the problems related to string encoding can occur in different components and the cause of them is not always obvious.

WDYT?

On 13.12.2021 20:01, Ivan Pavlukhin wrote:
I guess Nikolay is talking about the problem with UTF-8 in case string contains 
unpaired surrogate symbols
Folks, give me a clue why it is a problem? Naively it seems to be a
good restriction rather than problem. What problems can it cause in
practice?

2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev <ilya.kasnach...@gmail.com>:
Hello!

We already have a warning about this, see IgniteKernal.checkFileEncoding()

Regards,
--
Ilya Kasnacheev


пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <ivanda...@gmail.com>:

But now multiple components
independently serialize strings for their needs and use default
encoding
for this.
For example  DirectByteBufferStreamImplV2#writeString,
MetaStorage#writeRaw and so on
We should fix all of them.

BinaryUtils#utf8BytesToStr
Lets use this everywhere.

As for me, I'm expecting a way more problem with enforcing rule to fail,
rather than enforcing all components to use UTF-8
Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
simply do not consider at all.

пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <nizhi...@apache.org>:

Does Java String support all unicode characters and particularly does
it
support more characters than UTF-8

It’s not about Java, it’s about UTF-8 standard.

Please, take a look at [1]

In November 2003, UTF-8 was restricted by RFC 3629 to match the
constraints of the UTF-16 character encoding: explicitly prohibiting
code
points corresponding to the high and low surrogate characters removed
more
than 3% of the three-byte sequences, and ending at U+10FFFF removed
more
than 48% of the four-byte sequences and all five- and six-byte
sequences.

And [2]

The definition of UTF-8 prohibits encoding character numbers between
U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding
form
(as surrogate pairs) and do not directly represent characters.

Actually, we already has some modes to support this restriction of
UTF-8.
Please, take a look at BinaryUtils#utf8BytesToStr [3]


[1] https://en.wikipedia.org/wiki/UTF-8
[2] https://datatracker.ietf.org/doc/html/rfc3629
[3]

https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387
13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com>
написал(а):
UTF-8 can’t encode all UNICODE characters.
Nikolay, could you please elaborate? My understanding is that
encoding
we speak about matters for conversion from byte arrays to strings.
Does Java String support all unicode characters and particularly does
it support more characters than UTF-8 (I am not saying here that java
String uses UTF-8)?

2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>:
UTF-8 is already a default encoding in our BinaryObject format.
So....
I am
for unification.

пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>:

Hello, Ivan.

UTF-8 can’t encode all UNICODE characters.

13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com>
написал(а):
Khm, maybe a better variant is  to enforce all strings to be
encoded
in
UTF-8?
AFAIK multi OS cluster is a quite common case.


пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com
:
Igniters,

Recently we faced the problem that if the cluster consists of
nodes
running in the JVM with different encodings, many issues arise.
The root cause of the mentioned issues is components that use
`String#getBytes()` and `new String(<byte array>)`, which relies
on
the
system default encoding. Thus, if a string is deserialized on a
node
with a different encoding from the one that serialized it, the
deserialized string can be different from the original one.

For example:

Serialization/deserialization of string in communication messages
may
be
broken for some strings on nodes running in a JVM with a
different
encoding as DirectByteBufferStreamImplV2 uses String#getBytes()
to
serialize strings - [1]

Or the IgniteAuthenticationProcessor can compute different
security
IDs
for the user on different nodes in this case - [2]

What do you think, if we solve this problem globally, by
rejecting
to
join nodes that run on JVMs with different encodings?

As a result, we will be sure that all cluster nodes have the same
encoding and all related problems will be solved.

[1] - https://issues.apache.org/jira/browse/IGNITE-16106
[2] - https://issues.apache.org/jira/browse/IGNITE-16068

--
Mikhail


--
Sincerely yours, Ivan Daschinskiy

--
Sincerely yours, Ivan Daschinskiy


--

Best regards,
Ivan Pavlukhin

--
Sincerely yours, Ivan Daschinskiy


Reply via email to