Re: [DISCUSSION] Reject join of nodes with different character encodings

Mikhail Petrov Mon, 13 Dec 2021 10:30:15 -0800

Ivan, string with unpaired surrogates symbols are serialized anddeserialized by java UTF-8 decoder successfully but the result does notmatch the initial string. It may result in that if the user's logincontains these symbols, it will be distorted after deserialization andthe user will not be able to log in. I understand that it is a quiterare case.Anyway, the way to solve this problem was introduced here -https://issues.apache.org/jira/browse/IGNITE-3098

Frankly, it is not the topic I would like to discuss now. The mainquestion is - should we restrict the join of nodes with differentencodings or just fix all places where implicit default encoding is usedand specify the explicit one as Ivan Daschinsky suggested?

From my point of view, it is better to reject nodes with differentencodings (especially after Ilya Kasnacheev mentioned that we alreadyhave a warning "Differing character encodings across cluster may leadto erratic behavior"). It will help to avoid "erratic behavior", notjust warn about it. It is important since the problems related to stringencoding can occur in different components and the cause of them is notalways obvious.


WDYT?

On 13.12.2021 20:01, Ivan Pavlukhin wrote:

I guess Nikolay is talking about the problem with UTF-8 in case string contains 
unpaired surrogate symbols

Folks, give me a clue why it is a problem? Naively it seems to be a
good restriction rather than problem. What problems can it cause in
practice?

2021-12-13 16:32 GMT+03:00, Ilya Kasnacheev <[email protected]>:

Hello!

We already have a warning about this, see IgniteKernal.checkFileEncoding()

Regards,
--
Ilya Kasnacheev


пн, 13 дек. 2021 г. в 16:26, Ivan Daschinsky <[email protected]>:

But now multiple components
independently serialize strings for their needs and use default
encoding
for this.
For example  DirectByteBufferStreamImplV2#writeString,
MetaStorage#writeRaw and so on

We should fix all of them.

BinaryUtils#utf8BytesToStr

Lets use this everywhere.

As for me, I'm expecting a way more problem with enforcing rule to fail,
rather than enforcing all components to use UTF-8
Some weird cases  (surrogate pairs) we can (I strongly believe it is OK)
simply do not consider at all.

пн, 13 дек. 2021 г. в 15:15, Nikolay Izhikov <[email protected]>:

Does Java String support all unicode characters and particularly does

it

support more characters than UTF-8

It’s not about Java, it’s about UTF-8 standard.

Please, take a look at [1]

In November 2003, UTF-8 was restricted by RFC 3629 to match the

constraints of the UTF-16 character encoding: explicitly prohibiting
code
points corresponding to the high and low surrogate characters removed

more

than 3% of the three-byte sequences, and ending at U+10FFFF removed
more
than 48% of the four-byte sequences and all five- and six-byte
sequences.

And [2]

The definition of UTF-8 prohibits encoding character numbers between

U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding

form

(as surrogate pairs) and do not directly represent characters.

Actually, we already has some modes to support this restriction of
UTF-8.
Please, take a look at BinaryUtils#utf8BytesToStr [3]


[1] https://en.wikipedia.org/wiki/UTF-8
[2] https://datatracker.ietf.org/doc/html/rfc3629
[3]

https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387

13 дек. 2021 г., в 13:57, Ivan Pavlukhin <[email protected]>

написал(а):

UTF-8 can’t encode all UNICODE characters.

Nikolay, could you please elaborate? My understanding is that
encoding
we speak about matters for conversion from byte arrays to strings.
Does Java String support all unicode characters and particularly does
it support more characters than UTF-8 (I am not saying here that java
String uses UTF-8)?

2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <[email protected]>:

UTF-8 is already a default encoding in our BinaryObject format.
So....

I am

for unification.

пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <[email protected]>:

Hello, Ivan.

UTF-8 can’t encode all UNICODE characters.

13 дек. 2021 г., в 12:49, Ivan Daschinsky <[email protected]>

написал(а):

Khm, maybe a better variant is  to enforce all strings to be
encoded

in

UTF-8?
AFAIK multi OS cluster is a quite common case.


пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <[email protected]

Igniters,

Recently we faced the problem that if the cluster consists of
nodes
running in the JVM with different encodings, many issues arise.
The root cause of the mentioned issues is components that use
`String#getBytes()` and `new String(<byte array>)`, which relies
on
the
system default encoding. Thus, if a string is deserialized on a

node

with a different encoding from the one that serialized it, the
deserialized string can be different from the original one.

For example:

Serialization/deserialization of string in communication messages

may

be
broken for some strings on nodes running in a JVM with a
different
encoding as DirectByteBufferStreamImplV2 uses String#getBytes()
to
serialize strings - [1]

Or the IgniteAuthenticationProcessor can compute different
security
IDs
for the user on different nodes in this case - [2]

What do you think, if we solve this problem globally, by
rejecting

to

join nodes that run on JVMs with different encodings?

As a result, we will be sure that all cluster nodes have the same
encoding and all related problems will be solved.

[1] - https://issues.apache.org/jira/browse/IGNITE-16106
[2] - https://issues.apache.org/jira/browse/IGNITE-16068

--
Mikhail

--
Sincerely yours, Ivan Daschinskiy

--
Sincerely yours, Ivan Daschinskiy


--

Best regards,
Ivan Pavlukhin

--
Sincerely yours, Ivan Daschinskiy

Re: [DISCUSSION] Reject join of nodes with different character encodings

Reply via email to