Re: [DISCUSSION] Reject join of nodes with different character encodings

Nikolay Izhikov Mon, 13 Dec 2021 04:15:22 -0800

> Does Java String support all unicode characters and particularly does it 
> support more characters than UTF-8


It’s not about Java, it’s about UTF-8 standard.

Please, take a look at [1] 

> In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints 
> of the UTF-16 character encoding: explicitly prohibiting code points 
> corresponding to the high and low surrogate characters removed more than 3% 
> of the three-byte sequences, and ending at U+10FFFF removed more than 48% of 
> the four-byte sequences and all five- and six-byte sequences.

And [2] 

> The definition of UTF-8 prohibits encoding character numbers between U+D800 
> and U+DFFF, which are reserved for use with the UTF-16 encoding form (as 
> surrogate pairs) and do not directly represent characters.

Actually, we already has some modes to support this restriction of UTF-8.
Please, take a look at BinaryUtils#utf8BytesToStr [3]


[1] https://en.wikipedia.org/wiki/UTF-8
[2] https://datatracker.ietf.org/doc/html/rfc3629
[3] 
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/binary/BinaryUtils.java#L2387

> 13 дек. 2021 г., в 13:57, Ivan Pavlukhin <vololo...@gmail.com> написал(а):
> 
>> UTF-8 can’t encode all UNICODE characters.
> 
> Nikolay, could you please elaborate? My understanding is that encoding
> we speak about matters for conversion from byte arrays to strings.
> Does Java String support all unicode characters and particularly does
> it support more characters than UTF-8 (I am not saying here that java
> String uses UTF-8)?
> 
> 2021-12-13 12:56 GMT+03:00, Ivan Daschinsky <ivanda...@gmail.com>:
>> UTF-8 is already a default encoding in our BinaryObject format. So.... I am
>> for unification.
>> 
>> пн, 13 дек. 2021 г. в 12:50, Nikolay Izhikov <nizhi...@apache.org>:
>> 
>>> Hello, Ivan.
>>> 
>>> UTF-8 can’t encode all UNICODE characters.
>>> 
>>>> 13 дек. 2021 г., в 12:49, Ivan Daschinsky <ivanda...@gmail.com>
>>> написал(а):
>>>> 
>>>> Khm, maybe a better variant is  to enforce all strings to be encoded in
>>>> UTF-8?
>>>> AFAIK multi OS cluster is a quite common case.
>>>> 
>>>> 
>>>> пн, 13 дек. 2021 г. в 11:36, Mikhail Petrov <pmgheap....@gmail.com>:
>>>> 
>>>>> Igniters,
>>>>> 
>>>>> Recently we faced the problem that if the cluster consists of nodes
>>>>> running in the JVM with different encodings, many issues arise.
>>>>> The root cause of the mentioned issues is components that use
>>>>> `String#getBytes()` and `new String(<byte array>)`, which relies on
>>>>> the
>>>>> system default encoding. Thus, if a string is deserialized on a node
>>>>> with a different encoding from the one that serialized it, the
>>>>> deserialized string can be different from the original one.
>>>>> 
>>>>> For example:
>>>>> 
>>>>> Serialization/deserialization of string in communication messages may
>>>>> be
>>>>> broken for some strings on nodes running in a JVM with a different
>>>>> encoding as DirectByteBufferStreamImplV2 uses String#getBytes() to
>>>>> serialize strings - [1]
>>>>> 
>>>>> Or the IgniteAuthenticationProcessor can compute different security
>>>>> IDs
>>>>> for the user on different nodes in this case - [2]
>>>>> 
>>>>> What do you think, if we solve this problem globally, by rejecting to
>>>>> join nodes that run on JVMs with different encodings?
>>>>> 
>>>>> As a result, we will be sure that all cluster nodes have the same
>>>>> encoding and all related problems will be solved.
>>>>> 
>>>>> [1] - https://issues.apache.org/jira/browse/IGNITE-16106
>>>>> [2] - https://issues.apache.org/jira/browse/IGNITE-16068
>>>>> 
>>>>> --
>>>>> Mikhail
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Sincerely yours, Ivan Daschinskiy
>>> 
>>> 
>> 
>> --
>> Sincerely yours, Ivan Daschinskiy
>> 
> 
> 
> -- 
> 
> Best regards,
> Ivan Pavlukhin

Re: [DISCUSSION] Reject join of nodes with different character encodings

Reply via email to